Development of an Artificial Intelligence-assisted CRISPR-Cas Scan (AIL-Scan) strategy based on an ESM large language model
We assumed that by embedding the functional feature with protein primary sequences, we could trace the natural evolution rules and identify the CRISPR-Cas proteins in the metagenomics data directly without sequence alignments. To identify the CRISPR-Cas proteins, we developed an Artificial Intelligence-assisted CRISPR-Cas Scan (AIL-Scan) strategy (Fig. 1a). It includes the following steps:
-
1.
CRISPR-Cas training data is created by extracting CRISPR-associated (Cas) proteins from the NCBI database, classifying them by genes, and removing redundant sequences.
-
2.
Supervised fine-tuning of ESM on the CRISPR-Cas training data based on the biological information to predict the Cas protein.
-
3.
Feature analyses of Cas proteins, including cleavage activity, CRISPR-loci type, CRISPR loci-length, direct repeats, spacers, evolutionary analyses, MSA, and structures.
a The ESM language model is trained by Cas proteins, which were collected, classified, and clustered as input sequences. The Cas proteins were embedded and classified with multiple labels. The trans-cleavage activity prediction model was developed based on the ESM and small-scale experimental data of trans-cleavage. The trained model was applied to discover Cas proteins and predict features from the sequences extracted from the metagenome. The protein structures were visualized using Chimera59. The sequence alignment was visualized by Jalview61. b The receiver operating characteristic (ROC) curves and area under the ROC curve (AUC) for 12 Cas proteins and non-Cas proteins. c The test loss and test accuracy curves of AIL-Scan.
We generated our training data using reviewed NCBI gene data. We annotated the Cas1, Cas2, Cas3, Cas4, Cas5, Cas6, Cas7, Cas8, Cas9, Cas10, Cas12, and Cas13. Non-Cas proteins were extracted according to the following rules, without the annotation of Cas, and removing the proteins with sequence similarity over 40%. The Cas protein database was separated into a training or validation database using CD-HIT-2D with a 40% identity threshold to remove the redundant sequences and avoid overfitting. We collected 76567 non-redundant positive sequences and 13047 non-Cas proteins, which were deposited in NCBI before July 5, 2023 (Supplementary Fig. 1). The maximal protein length is less than 1764 amino acids. To obtain the best classification, we introduced the “focal loss” in the classification to solve the unbalance of the input data. We obtained the best model during the 13th Epoch of model training and obtained 97.75% accuracy for the ESM 2 model with 650 million (650 M) parameters (Supplementary Fig. 2). Using the 15 billion (15B) parameters model, we achieved the best performance in the 9th Epoch with 98.22% accuracy (Supplementary Fig. 2). This model maintained consistent performance, achieving an accuracy 97.68% on the independent dataset, i.e. TestSet2024, which contains sequences deposited in NCBI from July 6, 2023, to Oct 28, 2024 (Supplementary Tables 1–3). These results indicate a robust generalization of this model. The accuracy and prediction speed of AIL-Scan is comparable to the CRISPRcasIdentifier, which integrates HMMs and machine learning (Table 1 and Supplementary Fig. 3). CASPredict performed with the highest speed among the four software, although its accuracy is lower than the machine learning based software, i.e., AIL-Scan and CRISPRcasIdentifier. However, the NCBI data has been partially annotated by the HMM model, so we turned to validate AIL-Scan’s capability in recognizing “unseen proteins”. We utilized a recent dataset of 3601 Cas12 family protein sequences20, in which 3521 sequences (97.8%) had less than 90% similarity with the training set, meanwhile 3351 sequences (93.1%) had less than 40% similarity with the training set. This test set is named TestSet2025 and is significantly distinct from the training set in sequence space, making it suitable for evaluating generalization ability. AIL-Scan successfully identified 3182 Cas12 proteins, in contrast, the HMM model identified 1240 sequences, demonstrating the strong generalization capabilities of AIL-Scan. Considering the resource consumption, the 650M model is sufficient for the Cas prediction. We used ESM embeddings to reduce dimensionality with t-SNE for 77684 sequences and discovered that ESM can distinguish the differences in various Cas classifications. The ROC curves and AUC indicate the probability that the positive sample’s decision value is greater than the negative sample’s decision value for all the Cas and non-Cas proteins (Fig. 1b). The test loss and test accuracy also indicate that the model generalizes correctly and performs well on unseen data (Fig. 1c). We evaluated the model robustness using the 5-fold cross-validation. The average accuracy is 0.9786 and the standard deviation is 0.0013 (Supplementary Table 4).
We use the Global Microbial Gene Catalog (GMGC) metagenomic database for the Cas protein discovery21. We selected 50,000 bins with high quality from GMGC and extracted 20,000 MAGs, including CRISPR-loci, to test the performance of AIL-Scan. The protein sequences were predicted by Prodigal software22. We collected ca. 20,000,000 protein sequences shorter than 1500 amino acids for prediction. In comparison with the established methods, the AIL-Scan predicts 1379 Cas12a sequences.
Development of a trans-cleavage activity prediction model
The trans-cleavage activity of Cas12a has been used in various applications. Although many CRISPR-Cas12a proteins have been identified, few of them have been tested in the trans-cleavage experiments. Therefore, the main challenge encountered during this study lies in dealing with a small sample size coupled with high-dimensional embeddings, which often leads to convergence issues when employing most models. A total of 69 labeled Cas12a proteins (including three known Cas12a) were included in our analysis (Supplementary Data 1). Their trans-cleavage activities were assessed by the fluorophore-quencher (FQ) reporter assay. The trans-cleavage activity was defined as proteins displaying fluorescence intensity twice that of the negative control. Thirty-three proteins were classified as active in trans-cleavage activity, and the remaining 36 proteins were categorized as inactive. To evaluate the performance of our predictive model, a test set comprising 13 randomly selected proteins (approximately 20% of the sample) was used, while the remaining 56 proteins were employed for training purposes. Initially, we recorded the last embedding layers based on our fine-tuned ESM model for all labeled Cas12a protein sequences. These embeddings (1280 dimensions) were utilized as covariates to predict trans-cleavage activity.
Different forms of decision tree models are evaluated in this task. The results of our study demonstrate that Light Gradient Boosting Machine (LightGBM) achieves the highest accuracy among mainstream machine learning models, with an accuracy rate of 69.2% on the test set trained on embeddings. To address dimensionality-related challenges, principal component analysis (PCA) was employed to extract essential embeddings, with prediction performance evaluated across 2–15 principal components. Alongside PCA, we compared 31 alternative methods, including t-SNE, UMAP, and raw data. Detailed comparisons, training procedures, and results are provided in Table 2, Supplementary Table 5, and the supplementary notes. LightGBM, CatBoost, and RandomForest achieve the accuracy of 92.3% in the test set (12 out of 13 proteins are correctly labeled) with 4, 6, and 8 principal components, respectively. We can see that compared to training models directly with embeddings, extracting essential dimensions with PCA provides higher accuracies in predicting trans-cleavage activity (Supplementary Table 5). However, this model is still limited by the small dataset, more experimental data would improve its prediction accuracy. Additionally, we tested our prediction model on two unreported Cas12a proteins, i.e., the trans-cleavage activity of two Cas12a candidates: ArCas12a_1 (derived from Agathobacter rectale) and LeCas12a_3 (derived from Lachnospira eligens_B). Our model predicted that ArCas12a_1 has trans-cleavage activity but not LeCas12a_3. In the experiment, ArCas12a_1 demonstrated significantly stronger trans-cleavage activity than the negative control, while LeCas12a_3 did not (Supplementary Fig. 4). These experimental outcomes were consistent with our model’s predictions, supporting the generalizability and robustness of the prediction model.
CRISPR-Cas12a loci predicted from the metagenomics
We did further feature analyses of Cas12a candidate proteins. Phylogenetic analysis of Cas12 proteins suggests that the identified Cas12a proteins fall into the Cas12a clade (Fig. 2a). The classical CRISPR-loci, comprising essential elements such as Cas1, Cas2, and Cas4, play a pivotal role in type classification. To delve into these features, we employed AIL-Scan to predict Cas1, Cas2, and Cas4 proteins within the same CRISPR loci adjacent to the Cas12a sequence. Subsequently, we meticulously verified 300 predicted CRISPR loci to gain deeper insights manually. Normally, Cas12a is considered to have a unique CRISPR locus, comprising Cas1, Cas2, and Cas4. Intriguingly, the observed count of Cas1, Cas2, and Cas4 proteins was notably lower than that of Cas12a, suggesting the absence of these small Cas proteins in some Cas12a loci (Fig. 2b, c). Further stratification based on the number of integrase proteins led to the classification of CRISPR loci into eight distinct subtypes. The distribution of integrase proteins across these subtypes exhibited a sparse pattern (Fig. 2d). Notably, subtype VIII lacked any integrase proteins, subtype I encompassed Cas1, Cas2, and Cas4, while subtype VI exclusively featured Cas2. This nuanced classification sheds light on the diversity within CRISPR loci and underscores the intricate variations in the composition of integrase proteins among different subtypes. Our observations may provide unreported perspectives on correlations among different CRISPR-Cas systems and integrase proteins. Remarkably, the analyses using the 1000 predicted CRISPR Cas12a loci without manual verification show a strikingly similar distribution pattern as the result from the 300 manually confirmed ones, indicating this distribution is a universal phenomenon (Supplementary Fig. 5). To provide further insights, we measured the length of CRISPR loci, beginning from the start of the Cas12 protein and concluding at the first spacer. Subtype VIII emerged as the shortest, spanning mere 4200 bp, while subtype I is the longest, extending over 6100 bp. Particularly noteworthy were certain subtype I CRISPR loci exhibiting extraordinary lengths of up to 6700 bp, raising the possibility of harboring enigmatic protein elements (Fig. 2e). Aligned with the integrase variation, the numbers of spacers notably decreased in subtypes IV, VI, and VIII, underscoring the pivotal roles of integrases in spacer capture (Fig. 2f). Despite the divergence in spacer numbers, the stem-loop region corresponding to direct repeat sequences remained conserved (Fig. 2g). This consistent conservation hints at a shared structural element, emphasizing the importance of the stem-loop region in CRISPR loci across different subtypes.

a Phylogenetic tree of Cas12 proteins. The identified Cas12a proteins in this work were highlighted in red in the Cas12a family. b Cas12a subtypes with different combinations of accessory proteins, i.e., Cas4, Cas1, and Cas2. c Statistics of Cas12, Cas1, Cas2, and Cas4 from 300 CRISPR-loci, which were verified manually. The features of the first 1000 CRISPR-loci were analyzed in Supplementary Fig. 5. d Statistics of subtypes in the 300 CRISPR-loci. e Sequence length variation in different subtypes. DNA sequence length was calculated from the start codon of the Cas12a gene to the end of the first repeat. f Statistics of spacers in different subtypes. g Sequence alignment of direct repeats in the 300 CRISPR-loci. The sequence corresponding to the stem loop region of crRNA was highlighted with a gray background. h Distribution of Cas proteins in different subtypes and species. The subtypes were colored in the inner circle. The species were labeled in the outer circle. Error bar indicates mean ± s.e.m. measured from three technical replicates. n = 3. Statistical significance was assessed using one-way ANOVA analysis. The symbol ‘#’ indicated that the metagenomes in the corresponding subtypes did not contain spacer sequences. Source data are provided as a Source Data file.
To explore the distribution of the discovered proteins in the organisms, we constructed a phylogenetic tree using 300 candidate Cas12a proteins, which were manually verified, along with three known Cas12a (LbCas12a, FnCas12a, and AsCas12a). 232 Cas12a proteins from the Lachnospiraceae family cluster into one clade. Within this clade, subclade 1 consisted of 62 subtype I Cas12a proteins, 81 subtype VII Cas12a proteins, and a modest representation of other subtypes. Notably, subtype I and subtype IV emerge as the principal constituents within Subclade 2. Furthermore, Subclade 3 is marked by the exclusive presence of 28 subtype VIII Cas12a proteins originating from the Acutalibacteraceae family. It is worth noting, 94.6% of the identified Cas12a proteins originate from enteric microorganisms (Fig. 2h), which may be due to the ease of recovering high-quality genomes from enteric microorganisms. Additionally, the thermostable YmeCas12a (subtype I) is adjacent to subtype I Cas12a proteins (Supplementary Fig. 6).
Cas integrases in CRISPR loci
New insights highlight the structural diversity and functional roles of Cas integrases in CRISPR loci23,24,25,26,27. Cas1, Cas2, and Cas4 are essential for integrating foreign DNA into bacterial CRISPR systems, which generates bacterial immunity26. AlphaFold228 was applied to predict all protein structures in the eight distinct subtypes, providing insights into their variation, respectively (Fig. 3 and Supplementary Fig. 7). Cas1 proteins, encompassing 92–331 amino acids, are classified into eight types based on structure and sequence (Fig. 3a, b and Supplementary Fig. 7b). Type 8 is the most prevalent Cas1 protein, resembling AfCas1 (PDB: 4N06)29 and its N-terminal and C-terminal domains (NTD, CTD) contain with key catalytic sites in specific helices and loops (Supplementary Fig. 7c). Structural differences across types were analyzed via the Dali server30. The variation in CTD elements does not necessarily hinder foreign DNA acquisition31, emphasizing their structural flexibility. Cas2 proteins, containing 70–146 amino acids, also fall into eight subtypes, with type 8 showing notable structural similarities to E. coli Cas2 (PDB: 5DQT)32 but with unique N-terminal helices (Fig. 3c, d and Supplementary Fig. 7d–f). Other subtypes exhibit varied structural deficiencies, such as missing β-sheets or helices, affecting dimer interfaces and potentially altering DNA binding. This diversity underlines Cas2’s adaptability within Cas1–Cas2 complexes (Supplementary Fig. 7f)33. Cas4 proteins, comprising 79–206 amino acids, exhibit eight types (Fig. 3e, f and Supplementary Fig. 7g, h), with type 8 resembling I-C Cas4 (PDB: 8D3Q)24 but lacking specific helices critical for protospacer cleavage. Structural differences across subtypes, such as missing helices or β-sheets, impact spacer insertion and integration within CRISPR systems (Supplementary Fig. 7i). These findings broaden our understanding of Cas4 structural variations and their functional implications in bacterial immunity. The detailed structural features of integrases are analyzed in the Supplementary Note.

a, c, e The RMSD matrix of Cas1, Cas2, and Cas4 structure models constructed by AlphaFold2. Colors within the heatmap, ranging from dark blue to white, represent the RMSD values ranging from high to low. The protein names were colored based on their structure type classification. The color of each protein name corresponds to the protein structure type displayed in the right panel. b, d, f Typical structure models of Cas1, Cas2, and Cas4, which were classified into different types. Secondary structures were annotated for all protein types. Type 1–7 structures of Cas1, Cas2, and Cas4 were superposed onto each full-length type 8 structure, and secondary structures were labeled. The “αX” in type 1 of (f) indicates that it does not appear in other Cas4 structure types.
Cas12a proteins in the subtypes
The differences in the Cas12a structures are key features of the Cas12a subtypes. We analyzed the motifs of the Cas12a sequences and discovered conserved and distinct motifs in the different subtypes, which are key for the Cas12a functions (Supplementary Fig. 8). The analysis revealed that the catalytic residues within the RuvC and Nuc domains are highly conserved among all subtypes, reflecting their critical roles in enzymatic function. Specifically, the first catalytic aspartate in the triad resides within the conserved motif IGIFRGEERN. The second catalytic glutamate displays subtype-specific distributions, appearing as MED in subtypes I, IV, V, and VI, as M/LEN/D in subtype II, and as MEK/D in subtype VIII. The third catalytic aspartate is consistently located in the motif DADANG, specifically at the second “D”. Additionally, a highly conserved TSKIDP motif was identified across all subtypes, indicating a shared functional mechanism. Other conserved motifs showed variability among subtypes, suggesting distinct sequence characteristics while maintaining overall catalytic and structural integrity. We also built the structure models of 300 Cas12a proteins using AlphaFold2, except for the failed construction, and calculated the root mean square fluctuation (RMSF) for all candidate Cas12a proteins within one subtype (Supplementary Fig. 9). The detailed analyses are appended in the Supplementary Notes. The RMSF reflects the residue-wise structural difference within one subtype. The results suggested that, despite an overall conserved structural architecture, specific regions within the proteins exhibit variability that may reflect structural adaptations specific to each subtype.
Cas12a proteins have distinct cis– and trans-cleavage activities
Cas12a processes the pre-crRNA transcripts into mature crRNA by its endoribonuclease activity. Then the Cas12a–crRNA complex efficiently cis-cleaves a double-stranded DNA (dsDNA), which is initiated by a PAM motif recognition. The cleaved DNA segment that remains bound then induces non-specific degradation of single-strand DNA (ssDNA) (Fig. 4a).

a Scheme of Cas12a activation, cis-, and trans-cleavage. The Cas12a from different subtypes was labeled with different colors. b Binding of Cas12a with crRNAs investigated by electrophoretic mobility shift assay (EMSA). c Binding of Cas12a with DNAs investigated by EMSA. d Scheme of PAM analyses using a double-strand DNA (dsDNA) array. Normalized PAM heatmaps for EvCas12_2 (e), AmCas11a (f), RspCas12a_2 (g), CAGCas12a (h), and RbrCas12a_1 (i). Each heatmap was normalized from 6 genes, including endogenous genes EMX1, DNMT1, and FANCF, 2 sites from eGFP, and 1 site from MERS virus genes. The individual maps were shown in Supplementary Fig. 12. The DNA sequences were listed in Supplementary Table 8. The weblogs of the PAM sequences for each Cas12a variant are shown below the heatmap. Colors within the heatmap range from dark blue to white, illustrating the normalized intensity of each PAM sequence. Source data are provided as a Source Data file.
Therefore, we evaluated the RNA binding efficiency, DNA binding efficiency, cis– and trans-acting DNase activities of sixteen Cas12a proteins from eight subtypes derive from Anaeroglobus micronuciformis (AmCas12a), Eubacterium_G ventriosum (EvCas12a_1 and EvCas12a_2), Erysipelatoclostridium sp. (EspCas12a), Ruminococcus_E sp. (RspCas12a_1 and RspCas12a_2), Agathobacter rectale (ArCas12a), Lachnospira eligens (LeCas12a_1 and LeCas12a_2), UBA3388 sp. (UBACas12a), RC9 sp. (RCCas12a), CAG-127 sp. (CAGCas12a), Ruminococcus_E bromii_B (RbrCas12a_1, RbrCas12a_2, RbrCas12a_3 and RbrCas12a_4) (Fig. 4, Supplementary Fig. 10 and Supplementary Table 6). Remarkably, the direct repeat sequence of these candidate Cas12a proteins is conserved alongside their celebrated counterparts, i.e., LbCas12a (Fig. 2g and Supplementary Fig. 11). Therefore, we chose LbCas12a as the positive control in the following assays, as well as its crRNA scaffold in the screening step. All the Cas12a proteins show RNA and DNA binding ability as expected (Fig. 4b, c, Supplementary Fig. 10c, d, and Supplementary Table 7). However, the DNA binding ability of subtype I and subtype VIII are higher than other Cas12a proteins. According to the inherent trans-DNase activity of Cas12a, as well as the 4 bp PAM length, we developed a simple and efficient PAM detection method. We constructed 6 short dsDNA target arrays by annealing 256 kinds of PAM sequence primer pairs in each well, which target EMX1 site1, DNMT1 site1, FANCF site1, MERS site1, eGFP site1, and eGFP site 3 (Supplementary Table 8). Each dsDNA target was incubated with candidate Cas12a proteins, crRNA and FAM-BHQ reporter to detect fluorescence of each reaction system (Fig. 4d). Using this assay, we determined the PAM preference of EvCas12a_2, AmCas12a, RspCas12a_2, CAGCas12a and RbrCAS12a_1, EcCas12_2, RspCas12a_2, and CAGCas12a recognize T rich PAM, but AmCas12a prefer G-start PAM, RbrCas12a_1 recognize 5-GTV-3 PAM (Fig. 4e–i and Supplementary Figs. 11, 12).
To corroborate the cis-acting DNase activity of candidate Cas12a proteins, we incubated Cas12a proteins with a crRNA and a linearized plasmid dsDNA. All linearized dsDNA were degraded by candidate Cas12a proteins with comparable efficiency to LbCas12a at 37 °C, with the exception of RCCas12a (Fig. 5a and Supplementary Fig 13a). Sanger sequencing of the cleaved DNA ends revealed that AmCas12a introduced INDELs at 18 in NTS and 23 in TS, consistent with other Cas12a orthologs (Supplementary Fig. 13e, f). However, most Cas12a variants exhibited diminished DNase activity, resulting in the production of uncleaved DNA at room temperature (RT), except for subtype VIII Cas12a proteins, which lack integrases. (Fig. 5b and Supplementary Fig. 13b). Subtype II Cas12a variants are slightly less active than LbCas12a in single-strand (ssDNA) degradation, while EspCas12a, EvCas12a_1, EvCas12a_2, and ArCas12a exhibited moderate activity. In contrast, the other Cas12a variants displayed notably lower activity (Fig. 5c and Supplementary Fig. 13c). Most of these Cas12a proteins represent considerable cis cleavage activity but are a bit different in trans-cleavage activity compared to LbCas12a. The ion preference assay reveals that these Cas12a proteins can be activated by Mn2+, similar to the LbCas12a34. Divalent Mg ions prove ineffective in activating the trans ssDNA cleavage activity of low-activity Cas12a variants, and Mn2+ cation emerges as the catalyst for their trans DNase activity. (Fig. 5d and Supplementary Figs. 13d and 14) To investigate the genome-editing ability of candidate Cas12a in eukaryotic cells, we selected 6 target sites with canonical PAM, which can be recognized by all the tested Cas12a (Fig. 5e and Supplementary Table 9). AmCas12a exhibits an average editing efficiency of 49.6% across six sites, with remarkable peaks at sites 3 (85.4%) and 6 (84.9%). In contrast, EvCas12a_2 displays an average editing efficiency of 20.3%, with its highest performance observed at site 1 (25.8%). RspCas12a_2 and RbrCas12a_2, which lack integrase in the loci, yield modest average editing efficiencies of 14.3% and 17.8%, respectively, with notable peaks at site 3 (26.3% and 37.3%, respectively). ArCas12a shows comparable average editing efficiencies with AmCas12a (45.4%), which gets notable peaks at site 3 (75.8%). LeCas12a_1 shows an average editing efficiency of 6.2% and a maximum efficiency of 25.7% at site 2. UBACas12a exhibits nearly negligible editing efficiency, with the highest activity reaching 2.1%. At site 4, CAGCas12a and LeCas12a_2 demonstrate peak genome-editing efficacy, at 81.7% and 73.8%, respectively, with mean editing efficiencies of 28.8% and 26%. AsCpf1 attains an impressive average editing efficiency of 65.5%, with its maximum at site 6 (84.7%). Finally, LbCas12a shows an average editing efficiency of 25.6% and a maximum efficacy of 53.5% at site 6.

a, b Cleavage of dsDNA by Cas12a subtypes at 37 °C (a) and 25 °C (b). c Trans-cleavage of ssDNA by Cas12 subtypes using fluorescence-labeled ssDNA reporter. d Divalent cation ions’ preference for the Cas12a variants. Colors within the heatmap, ranging from dark blue to white, indicated the trans-cleavage activity from high to low. Time-course kinetic analyses were analyzed in the Supplementary Fig. 14. e Cellular gene editing efficiency on targeting sites. Two sites were selected from FANCF, EMX1, and DNMT1, respectively. The statistical significance was calculated using the LbCas12a as a reference at each site. The detailed sequences were listed in Supplementary Table 9. Error bar indicates mean ± s.e.m. measured from three technical replicates. n = 3. Statistical significance was assessed using a two-tailed unpaired t-test. Source data are provided as a Source Data file.
The AmCas12a–crRNA binary complex
The protein sequence identity of 16 candidate Cas12a proteins to AsCas12a, FnCas12a, and LbCas12a are low, ranging from 30%-46% (Fig.6a and Supplementary Fig. 15). In the three-dimensional structural landscape, Cas12a proteins within the same subclade exhibit a high degree of structural similarity. However, AmCas12a presents a subtle deviation, distinguishing itself somewhat from its subclade I Cas12a counterparts (Fig. 6d, f and Supplementary Fig. 15).

a Domain organization of the AmCas12a protein. Detailed protein sequences and alignments were supplemented by Supplementary Fig. 19. The REC1, REC2, PI, WED, BH, RuvC, and Nuc domains were highlighted with distinct colors, respectively. b The cartoon representation of the structure of the AmCas12a–crRNA and schematic of the crRNA used for structural analysis. The nucleotides of crRNA are labeled with numbers. c The structure of AmCas12 revealed by cryoEM. (PDB: 8KGF, EMDB: EMD-37219) The structure alignments comparison with known Cas12a and other variants was analyzed in Supplementary Fig. 17. The structural domains were distinguished according to the color codes at the bottom. d The RMSD matrix of Cas12 structure models constructed by AlphaFold2. Colors within the heatmap from dark blue to white represent the RMSD values from high to low. e Interaction network of crRNA with residues in AmCas12a. The detailed interactions of crRNA seed regions with AmCas1a were shown in Supplementary Fig. 18. f The Alphafold2 structure models of Cas12as, which were used in this paper. g Mismatch analyses of AmCas12a. Error bar indicates mean ± s.e.m. measured from three technical replicates. n = 3. Source data are provided as a Source Data file.
To understand the molecular details underlying the RNA binding behavior of AmCas12a, we achieved the cryo-EM map of the crRNA binding complex, which consists of AmCas12a and a 44-nt crRNA, at 2.9 Å resolution (Fig. 6b, c, Supplementary Figs. 16 and 17, and Supplementary Table 10). The AmCas12a–crRNA structure maintains a bilobed architecture (Fig. 6c), similar to other Cas12a structures35,36. Nonetheless, it is noteworthy that the AmCas12a–crRNA complex exhibits a distinct conformation when juxtaposed with its counterparts. Specifically, an observable rotational variance is discernible within the REC domain of AmCas12a when compared to the LbCas12a–crRNA and FnCas12a–crRNA complexes. Relative to LbCas12a and FnCas12a, the REC1 domain of AmCas12a presents a deviation of 7.3° and 9.4°, respectively. Simultaneously, the REC2 domain of AmCas12a manifests a rotational disparity of 4.8° and 6.2°, respectively (Supplementary Fig. 17d, e).
As observed in the LbCas12a and FnCas12a crRNA binary structures, the repeat-derived pseudoknot in the 5’ handle of the crRNA is ordered. However, the crRNA conformation is markedly different from that of the crRNA bound by LbCas12a or FnCas12a. Due to the flexibility of the spacer-derived part of crRNA, it’s almost unclear in the Cas12a–crRNA binary complex35,36. Notably, an extra RNA stem formed by A(1)–A(5) and U(18)–U(22) within the crRNA spacer region makes a part of spacer region including seed sequence well-defined in the central cavity of AmCas12a and adopt an A-form-like helical conformation, but A(−10)–G(−6) and G(6)–A(15) nucleotides of crRNA are unclear (Fig. 6b and Supplementary Fig. 18). To accommodate the double RNA stem substrate, the REC lobe of AmCas12a rotates away from the NUC lobe. Unsurprisingly, the docking of crRNA to Alphafold-generated AmCas12a causes a severe clash in the REC domain (Supplementary Fig. 15c). The attainment of conformational integrity within the extra RNA stem is orchestrated by intricate interplays involving the ribose and phosphate moieties of the crRNA backbone, engaging in multiple interactions with specific residues within the WED, REC1, and RuvC domains of AmCas12a (Fig. 6e). These include residues T19, H751, K522, and H861 from the WED domain, Y50 and R168 from the REC1 domain and Q1003 from the RuvC domain, all of which are conserved with Cas12a orthologs, except Q1003 which form a hydrogen bond with the phosphate of U(18) (Supplementary Fig. 18). Distinct from the FnCas12–crRNA complex, the spacer segment of crRNA major interacts with the WED domain of AmCas12a.
Compared to the LbCas12a–crRNA complex and FnCas12a–crRNA complex, the divalent Mg ions are in the same location (Supplementary Fig. 17a–c). Consistent with a seed sequence-dependent mechanism of DNA targeting and in broad agreement with previous analyses of AsCas12a, LbCas12a activities in vivo, and FnCas12a activities in vitro35,37,38, cleavage of DNA substrates with single-nucleotide mismatches in the seed segment was almost completely impaired, while mismatches in the PAM-distal region of the DNA target were mostly tolerated (Fig. 6g).
Specific detection of single-nucleotide mutation by AmCas12a
Cas12a is a promising tool in the next-generation molecule diagnosis, however, it suffers from the PAM limitation39. The oncogene SNP only has a small sequence window to probe, the traditional PAM, TTTV, could not cover all the SNPs. Therefore, we tested whether the AmCas12a can distinguish the SNPs without a traditional PAM. (Fig. 7a) The oncogene mutants, KRAS c.34 G > T (G12C), did not contain the available TTTV in the adjacent sequences (Fig. 7b). Among the Cas12a proteins that have undergone PAM preference testing, AmCas12a, EvCas12a_2, CAGGCas12a, and RbrCas12a_1 showed potential for recognizing the G12C mutation. The results revealed that AmCas12a exhibited the best performance (Supplementary Fig. 20). We designed the crRNA targeting the SNP (Fig. 7b). According to the fluorescence intensity, we selected the crRNAs inducing the strongest signals, i.e., crRNA 1 for the KRAS mutant (Fig. 7c). The AmCas12a can detect ten copies of the KRAS mutant (Fig. 7d). Furthermore, we diluted the target mutant and evaluated the sensitivity of detection. The AmCas12a can even distinguish 0.1% KRAS mutant in the wild-type gene background, which is more sensitive than the Sanger sequencing (Fig. 7e, f).

a Scheme of single-nucleotide mutant detection by Cas12a. b Synthetic crRNA for single-nucleotide KRAS mutation based on the PAM preference of AmCas12a. The single-nucleotide polymorphism (SNP) site was highlighted in red. c AmCas12a detection of KRAS G12C with various crRNAs and Mn2+. d Detection limit of KRAS mutant using recombinase polymerase amplification (RPA) integrated with Cas12a. The fluorescent images and fluorescence intensity of the 15-min reaction were shown. The copy numbers of the target DNA were shown on the x-axis. e Sensitivity of the AmCas12a detection. KRAS mutant DNA was spiked in the wild type sequences with various ratios, which were shown in the x-axis. f Sanger sequencing results of wild-type KRAS and mutant with different ratios. NC represented the negative control without target DNA. Error bar indicates mean ± s.e.m. measured from three technical replicates. n = 3. Statistical significance was assessed using a two-tailed unpaired t-test. Source data are provided as a Source Data file.