Isoform analysis of heterozygous putative splicing variants at the allele level using nanopore long-read sequencing

A concise pipeline for allele-discriminative isoform evaluation

To analyze isoforms at the allele level using nanopore long reads, we developed a pipeline (Fig. 1a and b) that utilizes a recently developed bioinformatics tool called FLAIR (Full-Length Alternative Isoform analysis of RNA). FLAIR calculates and depicts isoforms from nanopore long reads aligned to the reference genome and gene models4. Our pipeline consists of two main parts.

Fig. 1

Pipeline for isoform analysis of heterozygous putative splicing variants at the allele level using nanopore long-read sequencing. (a) The principle of processing splicing variants and long-reads with respect to alleles in this pipeline. We predicted putative splicing variants based on short-read whole-genome sequencing data. We divided nanopore long reads obtained through direct RNA or cDNA sequencing into two alleles (“Allele 1” or “Allele 2”) based on allele-informative SNVs (orange square pinned to Allele 1). Depending on whether the splicing variant is on the reads of transcripts (for example, exonic splicing variant (red circle pinned to Allele 1)) or not (intronic splicing variant, blue circle, which does not have information on which allele it belongs to), allele assignment (to define the allele which the splicing variant is associated with) or allele separation (to divide long-reads into two alleles based on allele-informative SNV), respectively, was conducted. Subsequently, allelic difference and statistics of isoforms were calculated per splicing variant and allele-informative SNV pair. (b) The structure of the pipeline. The Inputs to the pipeline are long-reads, splicing variants, and allele-informative SNVs (green blocks in “Input data” section). In the first part of the pipeline, for each putative splicing variant, we identified nanopore long reads spanning the variant and searched for candidate allele-informative SNVs that provide allelic information to the long reads. After removing reads with low base quality at the allele-informative SNVs and the SNVs with low coverage, we collected sets of nanopore reads separated by allele-informative SNVs, the associated putative splicing variant, and the allele-informative SNVs. We permitted long reads that satisfy one of two criteria: (1) they cover a single allele‑informative SNV—including splicing variants that simultaneously act as allele‑informative SNVs (denoted by the orange square and red circle at the same coordinate)—or (2) they cover multiple allele‑informative SNVs. In the second part of the pipeline, FLAIR is implemented for each set to calculate and visualize isoform lists, and to derive comparative statistics between alleles.

The first part of the pipeline processes input data into a summary table and allele-separated .bam (read) files after several filtering steps. Specifically, this part of the pipeline requires three input files: (1) A list of heterozygous putative splicing variants to be evaluated, annotated with snpEff13 on whole-genome sequencing data and added with the result of splicing site predictions by SpliceAI14(2) A list of heterozygous variants used for discriminating between the two alleles, which we refer to as “candidate allele-informative single nucleotide variants” (candidate allele-informative SNVs or simply allele-informative SNVs), and (3) Long-read cDNA or direct RNA sequencing data mapped against the reference genome (GRCh38). The lists of heterozygous putative splicing variants in this study were prepared with snpEff annotation and SpliceAI predictions. Because snpEff can be used to markedly narrow down variants to be evaluated by SpliceAI, it is useful especially for a quick check. For more comprehensive listing of splicing variants, annotation with snpEff was not used in narrowing down and only delta score by SpliceAI was used. Delta score in the results of SpliceAI is defined as the maximal probability of splicing alteration within the window (default set as 50 bp from the variant in the version used) by acceptor gain, acceptor loss, donor gain, and donor loss, and ranges from 0 to 1. For example, a variant chr8:129750591 C > T (ENST00000276708.9, c.944-21G > A) in GSDMC, was predicted to have delta scores such as (acceptor gain (at -2 bp position) 0.95 | acceptor loss (-21 bp) 0.92 | donor gain (4 bp) 0.00 | donor loss (-43 bp) 0.00), showing an example of high delta score in a splicing acceptor site. In contrast, another variant chr5:168468948T > C (ENST00000265293.9, c.3276-3T > C) in WWC1, was predicted to have delta scores such as (acceptor gain 0.27 | acceptor loss 0.00 | donor gain 0.00 | donor loss 0.00) at positions (3|-2|37|-38), showing an example of lower delta score as a splicing acceptor site and score 0 as a splicing donor site. Thus, each variant within the defined gene regions in the genome has delta scores as a result of the calculation. We applied different delta score thresholds for SpliceAI (0.2, 0.5, and 0.8) to generate lists of splicing variants with high recall (low precision), moderate recall/precision, and high precision (low recall), respectively. For each splicing variant, we filtered long reads that spanned the coordinate of the splicing variant and checked whether 10 or more long reads contained a base (A/T/G/C) and had base quality ≥ 10 (90.0% reliability) at each candidate allele-informative SNV using jvarkit15.

Our pipeline automatically counts available numbers of allele-informative SNVs, including splicing variant itself in case of exonic splicing variant or intronic variant expressed when intron retention occurred. If two or more variants are available and covered by long reads, then whatshap16a bioinformatics tool for haplotype calculation, assigns haplotypes. A result table “PN_minpvalue005_spvar_corrected.tsv” contains information on haplotypes. When the haplotype is calculated for the splicing variant and allele-informative SNVs, result folder contains phased variant call file (splicing variant-associated phased.vcf.gz), in which genotype is added by whatshap, such as “0|1” or “1|0” for each variant included. By looking at this information, one can see phases of each variant within this haplotype block. When haplotype information is not created or does not harbor the splicing variant (such as in case of intronic splicing variant) in the phased.vcf.gz file, one can still see separation of long-reads and how the two alleles of the allele-informative SNV are associated with separated isoforms (Fig. 1a, right panel). In the case of Patient2 later described, independent genomic analysis enabled haplotyping of a novel intronic splicing variant with the known missense variant outside of the cDNA amplicon. Interpreting the result of our pipeline in combination with genomic analysis thus can be used to help understand allele-specific isoform changes. The output of this first part of the pipeline consists of paired .bam files for each allele, splicing variant, and allele-informative SNV or haplotype, which are preserved for further calculation in the second part.

In the second part of the pipeline, these paired .bam files for alleles and associated gene names are used to calculate isoform differences between alleles using FLAIR. Both parts of the pipeline employ parallelization for faster computation (details are provided in the Methods section). The final outputs include PN_minpvalue005_spvar_corrected.tsv (a table with each line consisting of splicing variant, considered allele-informative SNV (or haplotype), long-read coverage for Ref and Alt allele of allele-informative SNV, lowest p-value for isoform changes by FLAIR, and haplotype information) (the bottom blue block in Fig. 1b), isoform classification diagrams, and isoform frequency data for each .bam file pair corresponding to each splicing variant and allele-informative SNV set. The detailed information for preparation of input files and output folders and files are described in Supplementary Note 5 and 6.

Isoform evaluation at the allele level for nanopore direct RNA sequencing

We first tested the pipeline on previously published transcriptome data from nanopore direct RNA sequencing of GM128786 (a HapMap project sample widely used as a standard human lymphoblastoid cell line), combined with short-read whole-genome sequencing data. From the whole-genome sequencing data, we predicted 401 heterozygous putative splicing variants using SpliceAI with a delta score threshold of 0.5 (moderate sensitivity and precision) without narrowing down variants by snpEff annotation. When this list of heterozygous putative splicing variants was used as input for the pipeline, we filtered 144 splicing variants which had at least one associated allele-informative SNV or haplotype and proceeded with FLAIR isoform analysis in the second part of the pipeline.

As a result, we identified 14 heterozygous putative splicing variants with statistically significant (p-value, Bonferroni corrected < 0.05) isoform differences between alleles, including IFIH1, a gene noted for having markedly different isoform composition between alleles in the original study6 (Fig. 2a and e, and Supplementary Table 1). In addition to the splicing variant in IFIH1, another heterozygous splicing variant in MMAB, which was also among the 34 genes found to have a unique isoform per allele in the published study, was shown to be associated with isoform changes (Fig. 2b and f, and Supplementary Table 1).

Fig. 2
figure 2

Identified splicing variants and allelic isoform differences in nanopore direct RNA sequencing and 5’ cap-trapping full-length cDNA sequencing (CTR-seq). (a-d) Diagrams of calculated isoforms, and bar-graph their respective isoform usage (e-h). Each isoform model, signified as a track in diagram, was supported by multiple long-reads along the parameters in FLAIR implementation. Isoforms labeled with Ensemble id (“ENST + number”) were those identified by FLAIR as registered transcript. Other isoforms without Ensemble id represent those not identified as registered in the gene model but exist in the sample. IFIH1 (a and e) and MMAB (b and f) are heterozygous splicing variants with associated allelic isoform differences, as calculated in the pipeline from direct RNA sequencing data. These variants were among previously reported loci with allelic isoform differences in the original article. The allelic isoform frequency differences are notable at the loci (e and f). The same colors between the diagram and corresponding isoform usage bar graph represent the same isoform. For instance, exon 8 is absent (highlighted by the dashed line in a) in the blue-colored isoform (ENST00000648433.1), which is almost exclusively expressed in one allele (e). (c, d, g, and h) Identified splicing variants using CTR-seq, chr6:26368051G > A (c.-6 + 1G > A; NM_007047.5; BTN3A2) in Control1 (c and g) and chr4:15722935G > A (c.851 + 1G > A; NM_004334.3; BST1) in Patient2 (d and h), which were both associated with skipping of the neighboring exon in one of the alleles, were shown (dark-brown colored transcript in (c) and (g), and blue colored transcript in (d) and (h), respectively).

We then tested the pipeline on the Nanopore long-read direct RNA sequencing data from cancer cell lines K562 (leukemia) and HepG2 (hepatocellular carcinoma) from The Singapore Oxford Nanopore Expression Project (SG-NEx)7combined with short-read whole-genome sequencing data (see details in the Method section). We predicted 70 and 101 heterozygous putative splicing variants, respectively in K562 and HepG2, using SpliceAI with a delta score threshold of 0.5 after narrowing down variants with snpEff annotation to only variants associated with splicing. As a result of our pipeline implemented with the same parameters as in GM12878, we identified two and two heterozygous putative splicing variants with statistically significant isoform differences between alleles, respectively in K562 and HepG2 (Supplementary Tables 2 and 3). The two genes, GIPC1 and RIPK2, where the two heterozygous splicing variants with isoform changes in K562 were found, of interest, are involved in tumor growth or invasion17,18 while the two genes, ULK3 and XRCC4, where the two heterozygous splicing variants with isoform changes in HepG2 were found, are known to be elevated in squamous cell carcinoma and associated with cancer susceptibility19,20.

Isoform evaluation at the allele level for full-length nanopore cDNA transcriptome

To further test our pipeline, we conducted full-length 5’ cap-trapped and poly-A captured whole transcriptome nanopore long-read sequencing (CTR-seq). This approach, proven effective in a previous study8 was applied to total blood RNA from three subjects: a healthy control (Control1), a patient with undiagnosed cerebellar ataxia (Patient1), and a patient with biopsy-proven McArdle disease who harbors both a novel splicing variant and a known missense variant (Patient2, the proband (II-1) in Supplementary Fig. 1a, b, and c). The clinical characteristics of the three subjects, along with detailed information about Patient2, are provided in Supplementary Note 1 and 3 and discussed further in the next section.

The sequencing statistics, including read number and quality, were comparable across the samples (Supplementary Table 4), but the average read length was relatively shorter in Patient1 (median read length: Control1, 1608 bp; Patient1, 1317 bp; Patient2, 1721 bp), possibly due to differences in sample quality. The bioinformatics pipeline was implemented as described earlier. The number of heterozygous putative splicing variants as inputs, predicted by SpliceAI with delta score 0.5 or higher, were 420 (Control1), 389 (Patient1), and 457 (Patient2). Of these, 143, 100, and 152 cumulative combinations of filtered splicing variants and associated allele-informative SNV(s) or haplotypes were further analyzed for isoform differences.

As a result, 25, 7, and 15 unique splicing variants, respectively, were found to have significantly distinct isoform compositions between alleles (p-value, Bonferroni correction < 0.05) within each subject (Supplementary Tables 5, 6, and 7). For example, heterozygous splicing variants such as chr6:26368051G > A (c.-6 + 1G > A; NM_007047.5; BTN3A2) in Control1 and chr4:15722935G > A (c.851 + 1G > A; NM_004334.3; BST1) in Patient2 were associated with skipping of the neighboring exon in one of the alleles (Fig. 2c, d, g, and h).

As for the 7 splicing variants identified in Patient1 with cerebellar ataxia (Supplementary Table 6), there is not a gene which is directly linked to neurodegeneration other than XRCC4, which was reported to cause microcephaly and progressive ataxia21. However, the allele frequency of the splicing variant in XRCC4 is high (Mean Allele Frequency of 0.7011) in East Asian population in gnomAD22 and is not considered to be causative for a rare disease. Although there might be another unidentified genetic variant that is responsible for Patient1, other underlying causes, such as epigenetic modifications, might also be possible.

And among 15 splicing variants identified in Patient2 with McArdle disease (Supplementary Table 7), no variants were found in PYGM, which is the causative gene for McArdle disease. A variant (chr1:179889309G > A) in TOR1AIP1, which is known to cause limb-girdle muscular dystrophy23 was found, but it has very high allele frequency (0.6240) in global population and unlikely to cause such rare disease. No other variant that can account for muscular pathology in this patient was found.

Parameter consideration for the pipeline

To further optimize the parameters of the pipeline, we examined how they affect the analysis results for nanopore direct RNA sequencing (GM12878) and CTR-seq (Control1, Patient1, and Patient2). First, we explored how the delta score thresholds for SpliceAI, ranging from 0.2 to 0.8, influenced the number of predicted heterozygous splicing variants for each subject. As described by the SpliceAI developers, the confidence in splicing variant prediction increases as the delta score increases: 0.2 for high recall (low precision), 0.5 (recommended threshold) for moderate sensitivity and precision, and 0.8 for high precision (low recall). We observed an approximately five-fold reduction in the number of predicted heterozygous splicing variants when increasing the delta score threshold from 0.2 to 0.5, with a milder reduction when increasing the threshold from 0.5 to 0.8 (Fig. 3a). This means that more splicing variants are put in the pipeline if the lower stringent delta score threshold is used. We observed that more heterozygous splicing variants which were finally found to be associated with statistically significant changes (corrected p-value) remained when lower delta score threshold was used (Fig. 3b).

Fig. 3
figure 3

Parameter consideration for the pipeline. (a) The number of heterozygous splicing variants predicted by SpliceAI. (b) The number of splicing variants calculated to be with statistically significant isoform changes. (c) The ratio of splicing variants with isoform changes to all splicing variants analyzed in the first part of the pipeline (Fig. 1). (d) The ratio of splicing variants with isoform changes to all splicing variants analyzed by FLAIR in the second part of the pipeline. Each data series for GM12878, Patient1, Patient2, and Control1 is represented by red, green, cyan, and orange dots, respectively.

Next, we investigated how the delta score threshold affects the likelihood that the tested splicing variants are associated with statistically significant isoform changes between alleles. We found that heterozygous splicing variants predicted with higher delta score thresholds tended to be more likely to be associated with statistically significant changes (corrected p-value < 0.05) in isoform content, as determined by FLAIR analysis (Fig. 3c). This trend was consistent between the delta scores of 0.2 and 0.5 across all samples, but was not observed between 0.5 and 0.8 in Patient1. We observed this trend both in the ratio of splicing variants with statistically significant isoform changes to all splicing variants analyzed in the first part of the pipeline (Fig. 1), as well as in the ratio of splicing variants with significant isoform changes to those filtered and analyzed by FLAIR in the second part of the pipeline (Fig. 3d). Thus, more predicted heterozygous splicing variants are to be subjected to evaluation in the pipeline when lower threshold for delta score is used, and result in finding increasing number of splicing variants with significant isoform changes, but at a lower rate.

We also examined the distribution of read lengths covering splicing variants with significant isoform changes between alleles, comparing it to the distribution of read lengths covering splicing variants that were not filtered out in the first part but did not show significant isoform changes in the FLAIR analysis. No consistent or clear pattern was observed across four samples (GM12878, Control1, Patient1, and Patient2) regarding whether longer reads were more likely to contribute to splicing variants with significant isoform changes (Supplementary Fig. 2). Given that the read length is inherently limited by the cDNA length, the trivial differences between the reads covering the splicing variants with significant isoform changes and those without may not be impactful (Supplementary Fig. 2), as long as the read spans the splicing variant and the altered exon-intron junction.

Isoform evaluation of PYGM at the allele level by nanopore cDNA amplicon sequencing in a patient with McArdle disease

Sanger sequencing of PYGM in Patient2 revealed a novel intronic variant (PYGM, NM_005609.3; c.1519-3T > G, chr11:64752507 A > C (GRCh38)) within intron 12–13, located 3 bp upstream of the start of exon 13, as well as a known pathogenic missense variant (c.347T > C, chr11:64758514 A > G (GRCh38), p.Leu116Pro) in exon 3 (Supplementary Fig. 1b). These two variants were confirmed to be in a compound heterozygous state, using CRISPR/Cas9 enriched targeted nanopore sequencing12 (Supplementary Fig. 1d and e). SpliceAI predicted that the novel intronic variant c.1519-3T > G was highly likely to affect splicing (delta scores: acceptor gain 0.98, acceptor loss 0.98, donor gain 0, donor loss 0), and it also had a CADD PHRED score of 23.6, above the commonly used threshold of 20 (top1%)24. A muscle biopsy specimen analyzed via reverse transcription-polymerase chain reaction (RT-PCR) and subsequent cloning further supported the presence of mis-spliced transcripts (Supplementary Fig. 3).

The PYGM variants in Patient2 did not pass the criteria of the pipeline for CTR-seq data, due to insufficient read coverage for the two variants. There were only two reads in the novel splicing variant of PYGM, and even when we implemented the pipeline with the lowest parameters (1 read for allele-informative SNV, and 1 read for FLAIR collapse step), it still could not catch the isoform changes associated with the novel splicing variant, because of too few reads. This limitation prompted us to further analyze PYGM transcripts using cDNA amplicon sequencing to detect isoform changes more sensitively. To overcome the limited availability of muscle specimen, we analyzed whole blood RNA samples using nanopore long-read sequencing of a 2 kb cDNA amplicon encompassing exons 9–20 of PYGM. The sashimi plot revealed exon 13 skipping and intron 12 retention, but the overall picture of aberrant splicing and isoform changes remained unclear (Fig. 4a).

Fig. 4
figure 4

Isoform evaluation of PYGM at the allele level by nanopore cDNA amplicon sequencing in a patient with McArdle disease. (a) Sashimi plot of nanopore cDNA amplicon sequencing in Patient2 (with a putative splicing variant and a missense variant) and three controls. While splicing abnormalities, such as intron 12 retention, were observed in some reads, the complete isoform content in Patient2 remains unclear. (b and d) Comparison of isoform contents between samples without allele separation or assignment, showing the exon 13 skipping isoform (marked by an asterisk in d) in Patient 2. The letter C denotes the canonical transcript. (c and e) Allele-assigned isoform analysis showed that the allele with the c.1519-3T > C splicing variant led to various splicing abnormalities, including exon 13 skipping (asterisk), intron 12 retention (double asterisks), and retention of both introns 12 and 13 (triple asterisks). (f) Predicted molecular consequences of the c.1519-3T > C splicing variant in Patient2. Exon 13 skipping is predicted to produce a truncated protein, while retention of intron 12 or both introns 12 and 13 is expected to result in premature termination (asterisk signifying a stop codon which terminates intron12 translation). (g) Pairwise structural alignment of AlphaFold2-predicted myophosphorylase protein, which lacks the R507_Q540 region (34 amino acids) due to exon 13 skipping caused by the novel splicing variant, compared to the wild-type protein. Left: Cartoon structure showing the wild-type protein in blue and the mutant protein in brown. Right: Surface structure showing the wild-type protein in orange and the mutant protein in green. Yellow rectangles indicate the R507_Q540 portion of the wild-type protein (in silver) and the truncated region in the mutant protein (arrows).

When comparing PYGM isoforms between Patient2 and three controls (clinical details in Supplementary Note 2) without separating/assigning alleles, exon13-skipped isoform (asterisk in Fig. 4b and d) was detected in 10.4% of the reads in the patient. Further evaluation of isoforms at the allele level within Patient2 more clearly showed detailed isoform changes: A part of cDNA amplicon reads covered the novel splicing variant by intron retention and other two allele-informative SNVs, and made it possible to assign the reads into the two alleles. Transcripts from the allele harboring the novel splicing variant (allele1_c.1519-3T > G in Fig. 4e) tended to produce exon13 skipping (22.3% of the reads, shown as single asterisk in Fig. 4c and e) or retention of intron12 (7.9%, double-asterisk), or both introns 12 and 13 (2.4%, triple-asterisk). The latter two cases were predicted to result in out-of-frame insertions of 568–899 bp, leading to premature termination of translation (Fig. 4c, e, and f). Exon13 skipping was predicted to affect the superficial structure of the PYGM protein, as modeled by alphafold225 (Fig. 4g).

Calculation time

As for calculation time on the pipeline, it was 160.4, 22.6, 3.9 h, respectively, for delta score thresholds of 0.2, 0.5, and 0.8 on the direct RNAseq data of GM12878. Also, when viewed from user + sys CPU time, they were 1690.9, 260.2, 36.5 h, respectively (Supplementary Table 8). Thus, the resulting computation time increased as the number of input variants increased in lower delta score threshold.

Therefore, the real time was achieved by parallelization using GNU-parallel26 in the first half of the pipeline and GNU-parallel plus FLAIR (internal parallelization) in the second half, on average about 10 cores working in parallel, when we set number of GNU-parallel threads as 10 in the first half and 6 in the second half.

Continue Reading