Variable efficiency of nonsense-mediated mRNA decay across human tissues, tumors and individuals | Genome Biology

Individual-level quantification of NMD efficiency

We quantified the individual-level NMD efficiency (iNMDeff) by utilizing matched RNA-seq and whole-exome sequencing (WES) data from two cohorts: (i) tumor tissues in The Cancer Genome Atlas (TCGA), comprising 9,766 tumor samples from 33 different cancer types, and (ii) normal tissues from the Genotype-Tissue Expression (GTex) consortium, consisting of 979 individuals across 56 tissues, totaling 17,382 samples (with a median of 19 tissues per donor).

We devised two orthogonal methodologies to estimate iNMDeff in each tumor or normal sample (Additional File 1: Fig. S1A). Both are implemented using a negative binomial regression (see Methods) to model either transcript counts or allele expression counts, comparing two groups: NMD targets versus controls. Each method was subjected to rigorous filters to ensure that the set of transcripts/variants and individuals used were less confounded by other sources of variability (see Methods).

Endogenous target gene (ETG) method

We started with a set of genes that had been experimentally identified as endogenous NMD targets, collated from various studies that perturbed NMD activity in human cell lines [5, 6, 13, 14, 51] (see Methods). For each gene, we categorized its transcripts into NMD targets and control transcripts based on containing predicted NMD-triggering features (see Methods and Additional File 1: Fig. S1B). Here, we relied on two known features: (i) uORF at the 5’ UTR, and (ii) at least one splice site (or EJC) within the 3’ UTR. Next, for each NMD target gene, we selected two transcripts, one NMD target and the other NMD non-target, serving as an internal control. Our iNMDeff estimate is the regression coefficient of the indicator variable stating whether the pooled transcripts within an individual are either NMD target (1) or control (0), from a negative binomial regression (see Methods); the iNMDeff can be interpreted as a negative log (base e) ratio of the expression levels of the NMD target transcripts divided by the expression levels of the control transcripts.

We calculated the iNMDeff for the different NMD gene sets from each study separately (see Methods): “NMD Karousis” [13], “NMD Colombo” [5], “NMD Tani” [6], and “NMD Courtney” [51], and we further used the transcripts tagged as “nonsense_mediated_decay” from the Ensembl gene annotation as an “NMD Ensembl” set. Additionally, genes found in at least 2 independent studies constituted a “NMD Consensus” gene set, and the union of genes across all studies was the “NMD All” gene set. For our control groups, we selected two sets of genes at random (excluding genes reported as NMD targets in any of the experimental studies mentioned above), selecting from genes with and without NMD-triggering features in their transcripts: “RandomGenes with NMD features,” and “RandomGenes without NMD features.” The former should behave similarly as the experimental NMD gene sets but is based only on computational prediction, while the latter is a negative control gene set.

Allele-specific expression (ASE) method

We used exonic coding germline variants (population minor allele frequency, MAF, < 20%) to define three variant sets: (i) “NMD-triggering PTCs” resulting from nonsense variants and from frameshifting indels; (ii) “NMD-evading PTCs” also resulting from nonsense and indels; (iii) synonymous variants. The latter two were used as negative control sets. For each variant, the mutated allele counts observed in RNA-seq reads were used as NMD target levels and the wild-type allele counts at the same locus were used as control levels. We calculated the iNMDeff for each individual using the three aforementioned NMD variant sets separately. Similarly as in the ETG method, all variants within the individual were pooled to estimate the genome-wide iNMDeff, obtained from the regression coefficient of the indicator variable stating whether the variants are either mutated (1) or wild-type (0) (see Methods). The iNMDeff interpretation is that it is a negative log (base e) ratio of the raw RNA-seq allele counts of the mutant versus wild-type variants.

In summary, we computed the NMD efficiency for each individual—iNMDeff—multiple times, varying the NMD gene/variant set used, in TCGA (Additional File 2: Table S1) and in GTex (Additional File 2: Table S2) cohorts; including the negative controls, this amounted to 11 NMD gene sets for the ETG method, and 3 NMD variant sets for the ASE method. Henceforth, the ETG gene sets and the ASE variant sets will be collectively referred to as “NMD sets”. 

Robustness of the individual-level NMD efficiency estimates

We focused on the ETG sets generated by combining genes from multiple studies—the stringent “NMD Consensus” and the permissive “NMD All” gene sets—while the ETG iNMDeff for the rest of the NMD gene sets are detailed in Additional File 1: Fig. S2A (TCGA) and Additional File 1: Fig. S2B-C (GTex). In TCGA and GTex, the iNMDeff estimates from the ETG method (Fig. 2A and Additional File 1: Fig. S2A-B) and from the ASE method (Fig. 2B and Additional File 1: Fig. S2C) show various trends that imply they are reliable. For instance, the higher-confidence “NMD Consensus” ETG gene set displayed a readout of slightly higher efficiency (i.e. the log fold-difference in expression between the NMD target transcript and its non-NMD counterpart in the same gene) than the more permissive “NMD All” set (p < 2e − 16, Mann–Whitney U test), which presumably contains weaker NMD target genes. The negative control gene set “RandomGenes without NMD features” shows negligible NMD efficiency (p < 2e − 16, when compared to “NMD Consensus”), with the median close to zero (Fig. 2A). See Additional File 3: Text S1 for more details.

Fig. 2

Individual-level and tissue-level quantification of variable NMD efficiency. A, Estimation of individual-level NMD efficiency (iNMDeff) using the ETG method across 9,766 TCGA tumor samples, showcasing three NMD gene sets: NMD All, NMD Consensus, a random gene set with NMD-triggering features (“RandomGenes w/ NMD feat.”); and as a negative control, a random gene set without NMD features (“RandomGenes w/o NMD feat.”). B, iNMDeff estimations using the ASE method for two NMD variant sets: NMD-evading and NMD-triggering PTCs, alongside a non-NMD variant set as a negative control (Synonymous). ****p < 0.0001, by two-sided Mann–Whitney U test for A and B. C, Gene expression levels (TPM) of two well-known NMD targets (RP9P and GAS5) compared against the ETG iNMDeff (using the “NMD Consensus” gene set) sorted from lowest to highest along the X-axis in GTex. D, Cell line ETG NMD efficiency (cNMDeff) estimated in HeLa (n = 8), HepG2 (n = 2), and K562 (n = 2) cell lines using the “NMD Consensus” gene set (n = 130). cNMDeff was calculated using negative binomial regression, similar to our iNMDeff main analysis (see Methods). Comparison between UPF1 knockdown (KD) and wild-type (WT) conditions showed consistent reduction in NMD efficiency across all cell lines. P-values were calculated using paired t-tests. Barplots and 95% confidence intervals represent the mean across cells. E, Variability in ASE iNMDeff between two example individuals from TCGA is depicted by the RNA variant allele frequencies (VAF) of coding germline PTCs, estimated as the proportion of mutant (MUT) allele RNA-seq counts over the total RNA-seq counts (MUT + wild-type (WT) counts) at the specific PTC loci. The individual with high iNMDeff (orange) has lower RNA VAF or due to NMD degradation of the MUT allele, and vice versa for the low iNMDeff individual (magenta). F, Spearman correlation between tissue-level NMD efficiency rankings based on median ETG iNMDeff and median ASE iNMDeff values per tissue (rank 1 denotes the tissue with highest median iNMDeff), for the GTex cohort. Tissues are grouped based on cell-of-origin: Nervous system-related tissues (Pan-nervous), Kidney-related tissues (Pan-kidney), Reproductive system tissues (Pan-reproductive), Gastrointestinal tissues (Pan-GI), and those originating from Squamous cells (Pan-squamous)

To further support the concept of ETG method for estimating iNMDeff, we compared the iNMDeff estimates (from “NMD Consensus” set) with the expression of two well-known NMD target genes, RP9P and GAS5 [5, 13]; of note, we excluded them from our methodology above to avoid circularity in this analysis. Encouragingly, individuals with high iNMDeff individuals with high RP9P and GAS5 in tumor samples from TCGA (Additional File 1: Fig. S2D) and healthy tissue samples from GTex (Fig. 2C). Additionally, the gene expression of key NMD factors were analyzed. For NMD genes UPF1, UPF2, UPF3A, UPF3B, SMG1, SMG5, SMG6, SMG7, and SMG9, but not SMG8, a similar pattern to RP9P and GAS5 was observed in both cohorts (Additional File 1: Fig. S2D-E). This is consistent with the negative feedback loop and cell-type-specific inherent in NMD factors, where inhibition of NMD by siRNA was reported to upregulate the mRNAs of all key NMD genes except for UPF3A (SMG8, and SMG9 were not tested) [52]. The autoregulation is maintained through internal NMD-triggering features in the genes for the various NMD factors [52]. This correlation was less notable or even absent for EJC genes MAGOH, RBM8A, EIF4A3, and CASC3 (Additional File 1: Fig. S2F-G).

To validate our ETG method for iNMDeff estimation, we sourced RNA-seq data from three cell lines (HeLa, HepG2, K562) from different studies [5, 53, 54] and processed it using the same methodology as our main analysis (see Methods). Using our “NMD Consensus” set of NMD target genes, we applied negative binomial regression to estimate cell line NMD efficiency (cNMDeff), paralleling the iNMDeff in our main analysis (see Methods, Additional File 2: Table S3). We observed a remarkable separation of cNMDeff scores between UPF1 knock-down (KD) and wild-type conditions (Fig. 2D). As predicted, UPF1 KD significantly reduced NMD efficiency compared to wild-type cells in HeLa cells (p = 1.4e − 06, n = 8 replicates), with consistent effects in K562 (p = 2.2e − 02, n = 2) and HepG2 (p = 0.11, n = 2). Notably, every UPF1 KD cell showed lower cNMDeff than any wild-type cell. As an additional analysis to confirm the robustness of the ETG Consensus NMD gene set, we performed leave-one-out resampling analysis on our 130 NMD Consensus genes, consistently reproducing the UPF1 KD versus wild-type differences (Additional File 1: Fig. S3A), demonstrating that our results are not driven by an outlying gene but rather stem from a consistent property of the NMD target gene set.

PRO-seq analysis of nascent RNA transcripts from three cell lines [55, 56] showed no significant differences in transcription initiation rates between NMD and non-NMD target transcripts, including when stratified by brain expression levels (Additional File 1: Fig. S3B), confirming that our ETG measurements reflect NMD activity rather than transcription rates (see Additional File 3: Text S1).

Next, as a test of reliability of the ASE iNMDeff estimates, we considered the differences in distribution of efficiency across the 3 sets of variants with different expected NMD efficiency (Fig. 2B). While ASE-derived individual level NMD efficiency is observable when calculated using “NMD-triggering PTCs”, when these ASE estimates are derived from negative control variant sets, such as “NMD-evading PTCs” or “Synonymous”, the NMD efficiency drops to almost negligible levels (p < 2e − 16 for both variant sets, Mann–Whitney U test, when comparing to “NMD-triggering PTCs,” Fig. 2B).

To illustrate the ASE iNMDeff estimation, we present examples of raw allele-specific mRNA expression counts at PTC variant loci in two representative colon adenocarcinoma MSI (COAD_MSI) tumors from TCGA (Fig. 2E). After applying our ASE filtering criteria for germline PTC variants (see Methods), we retain four PTC variants with notably low RNA VAF or RNA-seq counts of their alternative alleles, relative to the reference allele, for the individual “TCGA-F4-6856,” who was classified as having a high iNMDeff. This is consistent with the rapid degradation characteristic of an active NMD pathway. In contrast, the individual “TCGA-A6-6781,” who exhibits a low iNMDeff, bears ten PTC variants, in 8 of which the mRNA counts of reference and alternative alleles are proportionally balanced, consistent with a heterozygous variant. This broadly consistent lack of ASE across 10 PTCs implies a global reduction in the NMD pathway activity in that individual (at least in the sampled tissue), affecting many PTC variants therein.

Agreement between the two methods to estimate NMD efficiency

If both ETG and ASE methods are reflecting the true NMD activity of the individual, then a correlation between the two iNMDeff estimates would be anticipated. We proceeded to estimate pan-cancer and pan-tissue correlations between iNMDeff for ASE (using “NMD-triggering PTCs” variant set) and ETG (using “NMD Consensus” set) (Additional File 1: Fig. S4A-B), revealing a statistically significant correlation (Pearson p < 2–16 in both cases). The same trend was observed when correlations were calculated for each tissue or cancer stratified into subtypes, with positive correlations in 76 out of the 101 tested tissues/cancers (Additional File 1: Fig. S4C-D).

That ETG and ASE method agree well is seen upon stratifying the various samples by bins of iNMDeff. The top decile (samples with lowest ASE iNMDeff) of TCGA patients has a median scaled ETG iNMDeff of − 0.19, while the bottom decile (samples with highest ASE iNMDeff) has a median scaled ETG iNMDeff of 0.20 (difference at p = 4.12e − 13 by Mann–Whitney U test, Additional File 1: Fig. S4E). Similarly, in GTex the top ASE decile has a median ETG iNMDeff of − 0.25 while the bottom ASE decile has ETG iNMDeff = 0.34 (p < 2e − 16, Additional File 1: Fig. S4F). Correlations between the ETG and ASE estimates of iNMDeff across individuals are increasing as more stringent filtering criteria are applied to the ASE variants (from ~ 0.2 up to ~ 0.7 on GTex, see Additional File 1: Fig. S5A-B), presumably due to reduced noise due to low counts of PTC variants in some samples (see Additional File 3: Text S2). When pooling samples by tissue, we observe the correlation between ETG and ASE methods of R = 0.69 (GTex tissues, Fig. 2F) and R = 0.49 (TCGA tumor types, Additional File 1: Fig. S6A), supporting the concordance between the two NMD efficiency estimation methods, and suggesting systematic differences between tissues, which will be further tested below.

We also tested correlations between ASE iNMDeff and ETG iNMDeff and other biological covariates or technical variables, including considering separately our NMD gene/variant sets. The “NMD Consensus” ETG gene set correlates the most to the ASE iNMDeff (Additional File 1: Fig. S4A-B and S6B-C), and thus we used it as the default ETG method gene set (List of transcripts in Additional File 2: Table S4).

In summary, we developed two genomic methodologies for quantifying individual-level NMD efficiency—the ASE and ETG methods—which rely on RNA-seq signal in germline NMD-triggering PTCs variants, and in the “NMD Consensus” experimentally determined set of NMD target genes, for ASE and ETG methods, respectively. We demonstrated the robustness of these methods by various approaches, allowing us to use the ASE and the ETG method to investigate variation in iNMDeff across different human tissues and cancer types.

Significant variability in NMD efficiency across human tissues and individuals

Assessing inter-tissue variability of NMD efficiency

Next, we applied our data set of iNMDeff estimates for 27,148 different tumoral and normal tissue samples to rigorously test the hypothesis that NMD efficiency varies across human tissues more than expected at chance. We grouped the iNMDeff estimates by cancer types in TCGA (Additional File 1: Fig. S7A), and by type of normal tissue in GTex (Additional File 1: Fig. S7B), and tested for differences between the tissues using a randomization test (see Methods for details and schematic from Fig. 3A). The test statistic we used, termed “Inter-Tissue iNMDeff Variability Deviation” (ITNVD), is a measure of how much the iNMDeff differences between tissue medians deviate from expectation if the iNMDeff values were randomly distributed across samples from different tissues. If there is no tissue-specific variability in iNMDeff, the ITNVD value should be close to 0.

Fig. 3
figure 3

Significant variability in NMD efficiency across human tumors, normal tissues, and individuals. A Schematic of the randomization tests, illustrating the methodologies for assessing NMD efficiency variability: Tissue iNMDeff Deviation (TND) and Inter-tissue iNMDeff Variability Deviation (ITNVD) (see Methods). B Displays ITNVD test scores for each NMD gene set in the ETG method (bottom panel) and NMD variant set in the ASE method (top panel). Scores for non-NMD negative control sets are also shown for comparison. Data are presented for both GTex (left) and TCGA (right) cohorts. Positive scores indicate variability in iNMDeff across tissues or cancer types, with stars (*) denoting statistical significance at p < 0.05, by randomization test. C,D show TND test scores for each cancer type (or subtype) within TCGA (C) and each normal tissue within GTex (D). Positive TND values indicate that a tissue exhibits higher iNMDeff than what would be expected by chance, and vice versa for negative values. Tissues are grouped based on cell-of-origin: Nervous system-related tissues (Pan-nervous), Kidney-related tissues (Pan-kidney), Reproductive system tissues (Pan-reproductive), Gastrointestinal tissues (Pan-GI), and those originating from Squamous cells (Pan-squamous). The groups of tissues were ordered based on the median TND scores, arranging them from the highest to the lowest median scores, top to bottom. Complete names of tissue acronyms can be found in Additional File 2: Table S5. E Scaled ETG and ASE iNMDeff estimates in low-grade glioma (LGG) and glioblastoma (GBM) brain tumors from TCGA, categorized by genetic or histological subtypes (first two panels, ordered by median ETG iNMDeff). It also displays the neuron cell type proportions derived from bulk RNA-seq deconvolution (third panel) and their pan-cancer Pearson correlation (R) with ETG iNMDeff (fourth panel). F Mirrors the analysis in E but focuses on GTex brain tissues, stratified by different subregions, showcasing the variability of NMD efficiency in normal brain tissue contexts and its correlation with neuron cell type proportions

The test was applied across both ASE and ETG estimated iNMDeff in the TCGA and GTex cohorts. We observed a significant, moderate variability deviation across both cancerous tissues in TCGA, and across normal tissues in GTex (Fig. 3B and Additional File 1: Fig. S7C). Encouragingly, the NMD gene/variant sets exhibited a much higher variability deviation than the negative control gene/variant sets, with a mean ITNVD of 0.27 versus 0.01 (p = 3.5e–3, t-test). Specifically, the ETG “NMD Consensus” set demonstrated the largest deviation (ITNVD = 0.62 in GTex and 0.26 in TCGA, both with p = 5e − 4), and this was replicated in the ASE “NMD-triggering PTCs” variant set (ITNVD = 0.15 in GTex with p = 3e–3, and ITNVD = 0.07 in TCGA with p = 5.3e − 3). In contrast, the variability deviation in the negative control sets was negligible: the ASE of “Synonymous” variant set showed a ITNVD of 0.01 in GTex and 0.03 in TCGA, and the ETG “RandomGenes without NMD-triggering features” set had a ITNVD of 0.02 in GTex and 0.03 in TCGA. These results suggest there is an important non-random iNMDeff variability associated with tissue and/or cell type identity, observed across both cancers and normal tissues.

A recent study suggested no differences in NMD efficiency between human tissues [57]. To reconcile this with our results we compared our ranking of tissues, ordered by the tissue median iNMDeff, with the ranking published by Teran et al. [57], who used ASE of PTCs from GTex data (Additional File 1: Fig. S8A). We found a strong correlation between their tissue rankings and our results, using either our ASE (R = 0.75, p = 7e − 10) or our ETG estimates (R = 0.70, p = 1.62e − 8). Our analysis demonstrates this differential NMD efficiency across tissues is significant.

Significant differences between tissues in NMD efficiency

To identify which tissues and cancer types have the most and least efficient NMD and if their differences from other tissues are significant (Additional File 1: Fig. S7A-B), we compared the variation in iNMDeff within each specific tissue to the inter-tissue variation. We devised a statistic of Tissue-wise iNMDeff Deviation (TND, Additional File 2: Table S6-S7), with a baseline and a statistical significance derived from a randomization test (see Methods and schematic in Fig. 3A). The resulting distribution of deviations (Fig. 3C,D) suggests the tissue variability in NMD efficiency, where positive TND values indicate that a tissue exhibits higher iNMDeff than what would be expected by chance, and vice versa for negative values. Notably, 38 out of 54 tissues, and 19 out of 32 cancer types show a significant TND (p < 0.05 by randomization test) by the ETG method (for brevity, a selection is shown in Fig. 3C,D, and the complete List is in Additional File 1: Fig. S9A-B). The test for significant tissue-specificity of NMD efficiency replicates for 43% of these tissues in the ASE method as well. Next, we have categorized these into five broad primary groups for ease of interpretation: Nervous system-related tissues (Pan-nervous), Kidney-related tissues (Pan-kidney), Reproductive system tissues (Pan-reproductive), Gastrointestinal tissues (Pan-GI), and those originating from Squamous cells (Pan-squamous) as denoted in a previous study [58].

The Pan-GI tissues displayed the highest iNMDeff in comparison to randomized values, i.e. positive TND, replicated across both the ETG and ASE NMD methods and in both TCGA and GTex cohorts. Specifically in TCGA (Fig. 3C), colon, stomach, and rectum adenocarcinomas (COAD, STAD, READ, respectively) exhibited significantly higher ETG iNMDeff than random (TND varying 0.03–0.16 for the 3 GI tissues, p = 3.5e − 2 to 1.2e − 3). A similar pattern was observed in the GTex cohort (Fig. 3D) for normal colon (TND = 0.09, p = 9.7e − 4) and stomach (TND = 0.08, p = 9.7e − 4) tissues. Notably, MSI cancer subtypes, such as COAD_MSI and STAD_MSI, demonstrated higher iNMDeff compared to their non-MSI tumor counterparts (Additional File 1: Fig. S9C-D), placing them among the cancers with the highest ETG iNMDeff (TND = 0.18, p = 1.2e − 3 and TND = 0.12, p = 2.3e − 3, respectively). Esophagus mucosa (ESPMCS) normal tissue displayed significantly higher iNMDeff in both ETG (TND = 0.17, p = 9.7e − 4) and ASE methods (TND = 0.08, p = 2.7e − 3); however, esophageal cancer showed a lower iNMDeff for both squamous cell carcinoma and adenocarcinoma (ESCA_scc and ESCA_ac, respectively; Fig. 3C) types, suggesting that some cell types might undergo a change in NMD efficiency during transformation from normal tissue into cancer.

In contrast to the Pan-GI, tissues classified as Pan-reproductive generally exhibited lower iNMDeff (Fig. 3C,D and Additional File 1: Fig. S9A-B) than expected based on randomization, i.e., negative TND. This trend was consistent across both TCGA and GTex cohorts and observed in both ETG and ASE methods. For example, in GTex’s Pan-reproductive normal tissues: ovary, cervix ectocervix (CVXECT), and cervix endocervix (CVSEND) all showed lower iNMDeff in ETG (TND = − 0.16, − 0.19, and − 0.20; p = 9.7e − 4, 8.7e − 3, and 8.7e − 3, respectively) and in ASE this replicated for ovary (TND = − 0.09, p = 2.2e − 2). A similar trend was observed in TCGA, with ovarian serous cystadenocarcinoma (OV, ETG TND = − 0.05, p = 1.2e − 3, ASE TND = − 0.02, p = 4.7e − 1) and cervical squamous cell carcinoma and endocervical adenocarcinoma also exhibiting lower iNMDeff in both methods (CESC, ETG TND = − 0.04, p = 1.4e − 2, ASE TND = − 0.08, p = 0.12).

For breast normal tissue, a slightly lower trend towards lower iNMDeff was observed in ETG (non-significant in ASE), aligning with trends in TCGA for breast invasive carcinoma in the basal subtype and the normal-like subtype (Fig. 3C), but not in the other subtypes (BRCA_LumA, BRCA_LumB, BRCA_Her2). This suggests that cancer types with very diverse subtypes, such as breast cancer, may display differences in NMD efficiency between subtypes. For other normal Pan-reproductive tissues/cancers see Additional File 3: Text S3.

Analysis of Pan-squamous and Pan-kidney tissues overall did not yield significant and/or consistent results across the two NMD methods, in either TCGA tumors or GTex tissues (Fig. 3C-D). See Additional File 3: Text S3 for these and the rest of tissues/cancer types.

As an additional approach to assess if NMD efficiency varies across tissues, we created a linear model predicting iNMDeff, including tissue or cancer type, and other covariates (see Methods). In this model, the removal of the cancer type/tissue variable significantly reduced the explained variability (R2), more so than removal of any of the other variables (Additional File 2: Table S8). This was evident in both ETG and ASE methods, where the full model explained 17.1 and 58.7% of variability in TCGA and GTex cohorts, respectively, dropping to 6.3 and 13%, respectively, upon the tissue variable removal.

Lower NMD efficiency in the tissues of the nervous system

Notably, the Pan-nervous group of tissues exhibited lower observed iNMDeff (Additional File 1: Fig. S7A-B) compared to randomized expectations (Fig. 3C,D), and this result remained robust when equalizing the number of samples across tissues (Additional File 1: Fig. S10). In the GTex cohort, almost all brain subregions showed a significant reduction in iNMDeff, many with large effect sizes, as observed in the TND statistic in one or both NMD methods. Specific examples replicating in both iNMDeff methods include the cerebellum (BRNCHA, ETG TND = − 0.48, p = 9.7e − 4; ASE TND = − 0.17, p = 2.8e − 3), the cerebellar hemisphere (BRNCHB, − 0.49, 9.7e − 4; − 0.15, 2.8e − 3), along with the cortex (BRNCTXA, − 0.12, 9.7e − 4; − 0.08, 1.4e − 2), and further the frontal cortex BA9, hippocampus, and the substantia nigra show significance in one of the NMD estimation methods (ASE or ETG) and a consistent trend in the other method. Additionally, nervous-related tissues such as the pituitary (PTTARY) and nerve/tibial (NERVET) displayed reduced iNMDeff in both methods (Fig. 3D). In TCGA data, two brain cancer types that arise from glial cells are available for comparison, and the same significantly reduced iNMDeff was evident in low grade glioma (LGG, − 0.14, p = 1.2e − 3, − 0.14, p = 8e − 3). This was further seen in the nervous system-associated pheochromocytoma and paraganglioma tumors (PCPG, − 0.06, p = 1.2e − 3, − 0.11, p = 0.12), for both ETG and ASE methods.

Next, we hypothesized that different cell types within the nervous system might have different NMD efficiencies. To determine if the strong reduction in iNMDeff observed in normal and cancerous brain tissues is specifically linked to neuronal cells or to glial cell types, we utilized cellular composition deconvolutions from GTex tissues bulk RNA-seq data, as estimated by Donovan et al. [59]. Analyzing the neuron cell type proportions in each brain tissue type, ranked by the median ETG iNMDeff, revealed that tissues with lower ETG iNMDeff had a higher proportion of neurons (Fig. 3E). Notably, the two brain samples with the lowest iNMDeff, the cerebellum regions BRNCHA and BRNCHB, had the highest neuron proportions (medians of 65 and 77%, respectively). The correlation between neuron proportion and ETG iNMDeff across all brain subregions was negative (R = − 0.57, Fig. 3E).

In TCGA tumor samples, we performed a deconvolution using the UCDBase method [60] and analyzed glioblastoma (GBM) and LGG tumor samples (see Methods). We stratified these cancer types into subtypes based on cell of origin or mutation type information[61] (Fig. 3F). The cancer subtype with the highest estimated neuron cell type proportion (median of 16%), LGG_Codel (characterized by the co-deletion of chr 1p/19q), exhibited the lowest iNMDeff (median ETG − 0.64), while at the other end, GBM_LGm6 (characterized by the DNA methylation cluster LGm6) had a median neuron proportion of 2% and a median ETG iNMDeff of 0.21. The ETG iNMDeff correlation with estimated fractions of neurons across these cancer subtypes was also negative (R = − 0.30, Fig. 3F).

As an additional analysis, we assessed the expression of the neuron-specific marker RBFOX3 [62], which was negatively correlated with ETG iNMDeff (TCGA: R = − 0.31, p = 4.4e − 15; GTex: R = − 0.30, p < 2.2e − 16), whereas the glia-specific marker AQP4 [63] was positively correlated with iNMDeff (TCGA: R = 0.15, p = 2.2e − 4; GTex: R = 0.32, p < 2.2e − 16) supporting that neural cells in specific, rather than glia or other cell types infiltrating these tumor samples, are the cell types with lowered NMD activity in the central nervous system (Additional File 1: Fig. S11A-B). This provides an explanation why between the two major brain cancer types in TCGA, the LGG has lower iNMDeff compared to the GBM (median neuron proportion of 10% vs 3%, and median ETG iNMDeff of − 0.52 vs 0.04).

In sum, our analysis underscores the significant variability of NMD efficiency across tissues, with high iNMDeff in digestive tract tissues, and low iNMDeff in the reproductive and nervous system tissues, the latter being associated with neuron content of the particular tissue.

Extensive inter-individual variation of NMD efficiency

Having established there exist significant differences in NMD efficiency between human tissues, we next asked whether NMD efficiency is variable between different individuals, compared to a baseline variation between different variants within the same individual. For this, we derived the measure of NMD efficiency for a set of ASE PTCs (termed pNMDeff for PTC NMD efficiency, as detailed in Methods) in the TCGA separately for somatic and germline variants, and additionally for GTex germline variants. Next, we compared the intra-individual variation (different PTCs in the same individual) with a randomized baseline, and then the inter-individual variation (same PTC across individuals) with a control randomization (Additional File 1: Fig. S11C).

The randomization controls display higher variances than either inter-individual pNMDeff (excess variance = 0.83–1.01 across the three datasets, p = 2.6e − 7 to 1.3e − 83) or intra-individual pNMDeff (excess variance = 0.41–0.78, p = 1.2e − 1 to 1.7e − 36). The two tests for non-random variation suggest that each PTC has some inherent property determining its pNMDeff, and that this property can be shared across different PTCs within an individual but is different between individuals. The former test result is not unexpected given the various known rules of NMD efficiency depending on PTC placement [1, 8]; however, we note that here we selected those PTCs that should trigger NMD by the known rules, yet there is still systematic variation, therefore suggesting that additional NMD rules may exist. The latter test result means that the PTCs between individuals are more consistent (less variability) in pNMDeff than expected at chance, suggesting mechanisms governing individual-level NMD efficiency. Compared to the randomized controls, we observe a significant reduction in variance for both tests, reflecting the underlying rules that govern NMD efficiency. This finding underscores two key points: first, that there exists unexplored systematic variation in NMD efficiency between different PTCs even after accounting for known NMD rules, and second—more relevant to the focus of our study—that there is substantial variation in NMD efficiency between individuals.

As an additional statistic to support this, when examining the Spearman correlations of pNMDeff between PTCs (Additional File 1: Fig. S11D): the randomly selected pairs of samples for the same PTC (inter-individual) had a notably higher correlation than randomized controls, and similarly so for the randomly selected pairs of different PTCs within individuals (intra-individual) (p < = 2.2e − 16 for both, Additional File 1: Fig. S11D). It bears mentioning that, notwithstanding this systematic variability in NMD efficiency both across PTCs and across individuals, much of the NMD efficiency across different contexts is preserved, compatible with the broadly conserved activity of the RNA surveillance via NMD. These findings motivated us to explore mechanisms that might underpin the inter-individual variability in NMD efficiency.

Somatic pan-cancer CNA signatures are associated with NMD efficiency

A comprehensive analysis of genetic associations between somatic mutations and iNMDeff across TCGA tumors, analyzing over 6.4 million tests across various cancer and NMD-related genes (see List in Additional File 2: Table S9) and cancer types, revealed only two significant associations: a missense variant in the TLX1 gene in lung adenocarcinoma (LUAD; linked with enhanced iNMDeff) and a truncating variant in the CDH1 gene in breast invasive carcinoma (BRCA; linked with decreasing iNMDeff). See Additional File 3: Text S4, Additional File 1: Fig. S12 and Additional File 2: Table S10 for details.

We next proceeded to test associations of iNMDeff with somatic copy number alterations (CNAs). Given that CNAs often manifest as broader alterations affecting many neighboring genes up to an entire chromosome arm, we devised a custom method for somatic CNA association analysis. Instead of testing the gene-level CNA associations, which displayed inflation in our QQ plots (Additional File 1: Fig. S13A-B), possibly because of extensive genetic linkage between CNA of neighboring genes, our method considers recurrent large-scale, multi-gene CNAs as monolithic units.

To do so, we performed a sparse principal component analysis (sparse-PCA) on the pan-cancer gene-level CNA data (see Methods, Additional File 1: Fig. S14A), identifying 86 principal components, termed CNA principal component signatures (CNA-PCs, Additional File 2: Table S11). Notably, the leading (i.e. higher variance explained) 47 CNA-PCs represented recurrent larger-scale alterations, often spanning approximately arm-level changes, while the subsequent CNA-PCs pinpointed more localized recurrent CNA events (Additional File 1: Fig. S14B-C). Thereafter, we tested associations between the individual’s weights of these CNA-PC signatures and our iNMDeff, at the pan-cancer and cancer type level (see Methods). Utilizing ASE method for discovery and ETG for validation, we identified 6 pan-cancer and 3 cancer-type specific (in UCEC and PCPG) CNA-PCs significantly associated with iNMDeff at FDR < 10% (List in Additional File 2: Table S12). We further retained cases where direction of effect was consistent between ETG and ASE iNMDeff methods, resulting in 3 robust pan-cancer associations with iNMDeff (Fig. 4A): CNA-PC3 (ASE FDR = 1.3e − 2; ETG FDR = 1.2e − 37), CNA-PC52 (2.2e − 2; 3.1e − 9), and CNA-PC86 (8.7e − 2; 8.6e − 37). Encouragingly, these recurrent CNA signatures exhibited consistent direction in their iNMDeff associations across various cancer types (Additional File 1: Fig. S15). Of note, the CNA-PCs 3 and 86 exhibited overlapping genome-wide patterns of CNA gains across chromosome segments (Pearson’s R of 0.79).

Fig. 4
figure 4

Somatic pan-cancer CNA signatures are associated with NMD efficiency. A Pan-cancer CNA principal component signatures (CNA-PCs) associations with ASE iNMDeff across pan-cancer and individual cancer types, with the effect sizes shown as beta coefficients from linear models (see Methods). The vertical dashed red Line indicates a coefficient of zero, and the horizontal dashed red Line marks the significance threshold at 10% FDR for pan-cancer analysis. Three CNA-PCs (3, 52, and 86) showing significant and replicated pan-cancer associations are color-highlighted across cancer-types and labeled in the pan-cancer analysis. B CNA-PC3 reflects gene amplifications across chromosome 1, ordered by genome location, plotted along the X-axis, with amplifications assessed by averaging GISTIC CNA scores for each gene across TCGA samples (Y-axis). Samples are categorized into three bins based on scores from the pan-cancer CNA-PC3 signature: “High” showing complete chromosome 1q arm-level gain (gene dosage scores ~ 1), “Mid” with more localized gain in the 1q21.1–23.1 region (gene dosage ~ 0.8, with remaining chromosome at ~ 0.4), and “Low” showing no notable alterations in chromosome 1q (gene dosage scores ~ 0). C The left panel presents a scaled ETG iNMDeff of individuals classified by CNA-PC3 groups as “High,” “Mid,” and “Low.” The right panel provides a validation in cell lines from the CCLE dataset (Achilles project), categorized by 1q CNA status: gain, neutral, and deletion; NMD was assessed with the proxy model for ETG cNMDeff based on gene expression data (see Methods). P-values are calculated through a two-sided Mann–Whitney U test. D Gene-level iNMDeff scores are calculated as the median iNMDeff across individuals with focal CNA amplifications for each gene on chromosome 1, arranged by genomic location. The top panel shows pan-cancer CNA frequencies per gene in TCGA. The middle panel displays gene-wise iNMDeff scores, with red intensity indicating co-amplification frequency with NMD factor gene SMG5. The bottom panel zooms into the 1q21.1–23.1 region, highlighting a reduction in iNMDeff when amplified, with some selected genes categorized as: Other, Candidates, NMD-related (NMD genes outside the 1q21.1–23.1 region) and Candidates NMD (NMD genes within the 1q21.1–23.1 region). E Candidate gene prioritization from chromosome 1q using two scoring criteria: Correlation between gene expression and ETG iNMDeff (Y-axis) or CNA amplification (X-axis). Candidate genes are expected in the bottom-right quadrant, indicating sufficient correlation between expression and iNMDeff (negative) and between expression and CNA amplification (positive). Vertical and horizontal dashed black Lines denote the thresholds for selecting candidates. Genes within 1q21.1–23.1 are shown in orange, with NMD genes (SMG5, RBM8A, SF3B4, INTS3) highlighted in red, and remaining chromosome 1q genes in black. Of 30 genes meeting thresholds, 20 reside within the 1q21.1–23.1 region (18 “Candidates” and 2 “Candidates NMD” genes: SMG5 and INTS3), and 10 are from other 1q regions. Example candidate gene PMF1 and known NMD-factor SMG5 are displayed in the bottom panels, showing gene expression vs. ETG iNMDeff (bottom left) or CNA amplification (bottom right). F Simplified clustering heatmap of CRISPR KO co-dependency scores for 6 “Candidates,” 2 “Candidates NMD” (SMG5 and RBM8A plus SF3B4 and INTS3), and 21 “NMD-related” genes from chromosome 1q (complete clustering found in Additional File 1: Fig. S20). Of note, SF3B4 and RBM8A, despite not meeting ETG iNMDeff thresholds, were included as an additional “Candidates NMD” based on its related functions in NMD (spliceosome and EJC, respectively) and its location within 1q21.1–23.1. The analysis focuses on the 1q21.1–23.1 region, revealing a distinct cluster of candidate genes alongside NMD factor genes. Co-dependency scores have been normalized using the Robust PCA (RPCO) onion method [64], where generally, higher scores denote stronger gene functional associations

Chromosome 1q somatic copy number gain associates with reduced NMD efficiency

To further examine these three recurrent CNA signatures, we categorized individuals based on their scores for each CNA-PC. For CNA-PC3, individuals in group “High” predominantly exhibited an arm-level gain of chromosome 1q (gene dosage GISTIC scores ~ 1, Fig. 4B). In contrast, group “Mid,” with intermediate CNA-PC3 scores, displayed a more localized gain around the 1q21-23.1 region (gene dosage GISTIC ~ 0.8, with ~ 0.4 towards the 1q end). The group “Low,” with the lowest, ~ 0 scores in the sparse PCA, lacked notable CNA alterations at 1q (gene dosage change ~ 0). Interestingly, CNA-PC86 mirrored the pattern observed in CNA-PC3 but with a narrower focal gain at 1q21-23.1 (Additional File 1: Fig. S16A), thus both PC3 and PC86 recurrent CNA signatures reflect chr 1q gains, albeit at different scales. The individuals in groups “High” and “Mid” (by CNA-PC3 scores) who exhibit chr 1q gains, demonstrate a decrease in both ETG-estimated iNMDeff (Fig. 4C) and ASE (only “High” group, Additional File 1: Fig. S16B) and compared to those in group “Low,” who lack any notable CNA alterations (median ETG iNMDeff = 2.07 vs 2.12, p = 5.1e − 10; median ASE iNMDeff = 0.49 vs 0.53, p = 4.7e − 2, both two-sided Mann-Whitney tests comparing “High” against “Low”). A similar pattern was observed for individuals classified based on CNA-PC86 scores (Additional File 1: Fig. S16C).

To validate this 1q somatic CNA gain association with NMD efficiency in an additional data set, the GTex is not Suitable since it does not contain somatic alterations. Instead, we used the transcriptomic and CNA data of 1450 human cell lines from the Cancer Cell Line Encyclopedia (CCLE) [65]. We built proxy models of our ETG iNMDeff (see Methods) using global gene-level expression data to predict the NMD efficiency for every cell line (cNMDeff, Additional File 2: Table S13). The proxy models were trained in either TCGA or GTex cohorts (cross validation R2 = 0.4 and 0.82 for TCGA and GTex, respectively, Additional File 1: Fig. S17A). We replicated our findings above (Fig. 4C and Additional File 1: Fig. S17B): cell Lines with 1q gains exhibited a decreased cNMDeff in comparison to diploid cells (p = 0.02 and non-significant, for TCGA-based and GTex-based proxy models of cNMDeff, respectively). This reduction was observed more clearly when comparing cells with 1q gains to those with 1q deletions (p = 6.9e − 4 and p = 0.02, for TCGA and GTex models respectively).

We next asked which cancer types were most influenced by the iNMDeff changes Linked with chromosome 1q CNA gains. To study this, we analyzed the distribution of iNMDeff in individuals across the Low/Mid/High groups, for the identified CNA-PC signatures, considering each cancer type separately (Additional File 1: Fig. S18A). BRCA, liver hepatocellular carcinoma (LIHC), cholangiocarcinoma (CHOL), and LUAD had over 50% of their samples falling into group “High” for CNA-PC3, indicating that 1q gains are common in breast, Liver, and lung cancers. This is also reflected in their iNMDeff. For instance, for 62% of LUAD samples that had a 1q gain, two-thirds (67%) of those also had a lower ETG iNMDeff than the median ETG iNMDeff across the LUAD samples with no gain. These 1q gains were proposed to be selected because they increase dosage of the MDM4 oncogene [66], whose product downregulates the p53 protein post-translationally [67], hence phenocopying the loss of the tumor suppressor TP53 [68]. However, there could be other oncogenes in chr 1q that are driving this gain.

When examining the more localized 1q21-23.1 amplification represented by CNA-PC86, 10–13% of the tumors from BRCA, LUAD, and LIHC were found in Group “High” (Additional File 1: Fig. S18B), and associated with reduced iNMDeff. Furthermore, other cancers, including uterine carcinosarcoma (UCS), lung squamous cell carcinoma (LUSC), OV, and skin cutaneous melanoma (SKCM), also exhibited high incidences of both these recurrent 1q gain CNA-PC3 and PC86 (Additional File 1: Fig. S18A-B). We note this narrower CNA segment does not include the MDM4 oncogene, which is located at 1q32.1, thus other genes must be driving this recurrent genetic change in CNA-PC86, with possible candidates being known cancer genes BCL9 (1q21.2), ARNT (1q21.3), SETDB1 (1q21.3), or SF3B4 (1q21.2).

Identifying NMD-associated genes via focal CNA analysis

While CNA gains can affect numerous dosage-sensitive genes in one event, only a subset of the genes or a single gene within the CNA segment are expected to be directly implicated in the observed phenotype, here the NMD deficiency. Within the 1q arm reside three known NMD factors and two NMD-related genes identified in experimental and computational studies [69, 70]: RBM8A (located at q21.1), a core component of the EJC; SMG5 (q22) and SMG7 (q25.3), both core NMD factors; SF3B4 (q21.2), a known splicing factor and oncogene; and INTS3 (q21.3), involved in snRNA transcription. To prioritize candidate genes responsible for the NMD deficiency associated with 1q gains, we computed the median iNMDeff of individuals having a focal copy number amplification for each gene separately; considering only focal CNA events for this analysis helps distinguish between the Linked genes. Mapping these gene-wise scores along chr 1, we observed that tumor samples with CNA gains in genes proximal to the 1q21-23.1 region (Fig. 4D, top panel) have a pronounced reduction in ETG iNMDeff scores (Fig. 4D, bottom panel) and similarly in the ASE iNMDeff scores (Additional File 1: Fig. S19). Notably, the NMD-related genes mentioned above, except for SMG7, are situated within this narrower 1q21-23.1 region.

To narrow down potential candidate NMD-modulating genes within this gene-dense region, we employed two scoring criteria across all genes from chr 1q. First, we calculated the correlation between gene expression—presumably the downstream effect of the CNA relevant to the phenotype—and ETG/ASE iNMDeff. Second, as a supporting test for prioritization, we assessed the correlation between gene expression and CNA. The correlation between these two scores (Fig. 4E and Additional File 1: Fig. S20A for ASE) is stronger within our target region (R = −0.41) than outside (R = − 0.2), supporting that this segment harbors causal genes for reduced iNMDeff. We established thresholds for selecting candidates based on the two criteria (Fig. 4E), and among the 30 genes that passed the thresholds, 20 were from within the 1q21-23.1 region (“Candidates”), including INTS3 and SMG5 (“Candidates NMD” genes). Parsimoniously, the CNA of the core NMD factor SMG5 might generate the effect on NMD efficiency; however, we considered the hypothesis that CNA of other genes within this segment might affect NMD efficiency.

To pinpoint the potential causal gene(s), we considered genetic interactions inferred from gene-level CRISPR screening data obtained from the Achilles Project [71], allowing to infer functional links between genes by studies of codependency profiles across cell lines. Here, we used the de-biased data (onion method, Robust PCA (RPCO)), with improved power for inferring gene function [64] (see Methods). Our hypothesis posits that if these candidate genes from 1q21-23.1 are indeed implicated in NMD efficiency, their CRISPR knockout (KO) fitness effects should correlate with established NMD factor genes like UPF1, UPF2, UPF3B, SMG1, and others; the codependency implies a functional link.

Therefore, we clustered and visualized these codependency CRISPR scores involving our set of candidate genes (18 “Candidates” and 4 “Candidates NMD”), 21 “NMD-related” genes, and 18 random negative “Control” genes on chr 1q but outside the 1q21.1–23.1 region (see Methods; note that KIAA0907 was excluded from the analysis due to lack of CRISPR data). SF3B4 and RBM8A, despite not meeting ETG iNMDeff thresholds, were included as an additional “Candidates NMD” based on their related functions to NMD (spliceosome and EJC, respectively) and its location within 1q21.1–23.1. This revealed a prominent cluster of functional interactions (Additional File 1: Fig. S20B) comprising NMD factor genes such as SMG5, SMG6, SMG7, UPF1, and UPF3B, suggesting that the CRISPR RPCO scores [64] are powerful for identifying NMD factors. A simplified visualization (Fig. 4F) focusing on 6 candidate genes co-clustering with NMD factors, and excluding controls, showed two genes from 1q21.1–23.1, PMF1 and GON4L, showing notable CRISPR codependency scores with SMG5 (0.060 and 0.006, for PMF1 and GON4L, respectively), SMG6 (0.020 and 0.006), and SMG7 (0.019 and 0.012). Full set of genes is visualized in Additional File 1: Fig. S20B. For scale, the CRISPR codependency score between SMG5 and SMG7 is 0.077, and the majority of scores with control genes are 0. Furthermore, VPS72 and PRPF3 exhibited codependency with SMG5 (0.080 and 0.011, respectively), and so might have relevance.

In a more detailed quantitative analysis, we examined all 274 genes within 1q21.1–23.1. For each, we computed the mean CRISPR codependency scores with 10 core NMD factor genes and compared these with the mean CRISPR score for the same gene with the previous 18 random negative control genes on 1q. We identified 14 genes with FDR < 5% (one-sided Mann–Whitney U test). Notably and as expected, 3 of these genes are directly involved in NMD (SMG5, RBM8A, and SF3B4) (Additional File 1: Fig. S20C). The 4 genes identified above by a visual inspection of the clustering—PRPF3 (FDR = 4.6%, one-side Mann–Whitney U test), GON4L (4.6%), VPS72 (3.1%), and PMF1 (2.5%)—also emerged as significant in this analysis. Additionally, 7 genes not considered as candidates due to not meeting our initial threshold criteria (correlations with gene expression and CNA) were significant here. Interestingly, 10 of these 14 genes are related to a RNA metabolic process (see Discussion). Overall, our analyses of gene overexpression and genetic interactions underscore that the 1q21.1–23.1 genomic region harbors multiple genes with potential to modulate NMD activity upon CNA gains, some of which have not been previously identified as NMD factors, with prime candidates being PMF1 and GON4L.

Additionally, we analyzed the CNA-PC52 signature, also associated with reduced NMD efficiency and characterized by low-level gains in chromosome 2q31.1-2q36.3. This region, commonly altered in several cancer types (especially testicular germ cell tumors), contains four NMD-related genes (CWC22, SF3B1, NOP58, and FARSB). Further analysis supported the possible role of genes in this region in modulating NMD efficiency (see Additional File 3: Text S5 for details).

Rare germline variant association analysis identifies genes associated with NMD efficiency

We hypothesized that an additional cause for inter-individual variation in NMD efficiency could be germline genetic variation across individuals, affecting NMD-relevant genes. To test this, we conducted a gene-based rare variant association study (RVAS, see schematic from Fig. 5A), relying on the SKAT-O test [72] (see Methods).

Fig. 5
figure 5

Rare deleterious germline variants are associated with NMD efficiency. A Overview of our rare variant association analysis (RVAS), illustrating the SKAT-O framework for identifying associations between rare germline putative loss-of-function (pLoF) variants and individual NMD efficiency. We used either TCGA or GTex as discovery cohorts with the other serving as validation, and vice versa. The analysis spans 18 matched cancer-normal tissues, three sets of pLoF variants, and employs two distinct iNMDeff methods for gene-level associations. iNMDeff values were randomized as a control measure. B Number of replicated gene-tissue pairs across various tissues against the range of FDR thresholds from 1 to 10%. C shows replicated gene-tissue pairs at a 2% FDR threshold, categorized by pLoF variant set, NMD method (ASE or ETG), and the cohort where the association was identified (either discovery or validation). The values indicate the total count of individuals harboring a rare pLoF variant within the successfully replicated genes. Replication did not require the use of the same iNMDeff method as in the discovery cohort. For instance, for LAMC1 association, it was discovered in normal brain tissues (GTex) using ETG method and was further validated in brain tumors (TCGA) using the ASE method. D Gene burden test for the association between rare pLoFs within the KDM6B gene and iNMDeff. Effect sizes (beta coefficient) from linear model associations are shown for cancer types (TCGA, left panel) or normal tissues (GTex, right panel), segmented by NMD method, including the 95% confidence intervals. An asterisk highlights statistically significant associations of the original SKAT-O analysis at FDR < 5% (black) or FDR < 20% (red). Tissues with no values did not have samples with rare pLoFs in the gene

We defined three stringency thresholds for determining putative loss-of-function (pLoF) for the rare germline variants, i.e. those with a MAF < 0.1%. One threshold involved only NMD-triggering PTC variants, and the other two additionally included the predicted deleterious missense variants using CADD scores [73] at two different cutoffs: ≥ 25 (more stringent) and ≥ 15 (permissive).

We tested all genes in which at least 2 individuals carried a rare pLoF variant, for the three pLoF stringency levels, using the two iNMDeff estimation methods, ASE and ETG, and correcting by potential confounder variables (see Methods). Testing was performed in each cohort, TCGA and GTex, contributing cancer and normal tissue, respectively. Significant associations identified in one cohort (TCGA or GTex) were subsequently re-tested requiring replication in the other cohort (GTex or TCGA, respectively), relying on pairing 18 TCGA tumor types to corresponding GTex normal tissues (Additional File 2: Table S5). Replication did not require the use of the same iNMDeff method as in the discovery cohort, thus a significant TCGA association detected using the ASE iNMDeff method could be replicated in GTex using either ASE or ETG iNMDeff methods.

A total of 850,721 tests associating iNMDeff with rare germline variants were conducted in GTex and 221,613 in TCGA, after excluding dataset-tissue pairs to prevent inflation (lambda ≥ 1.5). The number of tests in GTex was higher due to multiple tissue samples analyzed per individual, allowing separate testing of the same gene across various tissues. Across the three variant sets, two NMD methods, and two cohorts, considering all directions of validation, we identified 48 tissue-specific significant associations (nominal FDR ≤ 2%) that were successfully replicated. This is significantly more compared to the 24 significant associations found by randomizing the iNMDeff estimates (p = 7e − 3, Additional File 1: Fig. S23A), supporting the presence of non-random associations between iNMDeff and germline rare variants.

Out of these replicated associations, 27 were discovered in GTex and validated in TCGA. These hits span 10 unique individual gene-tissue type pairs clustered in 2 matched-tissues: thyroid (4) and brain (6) (Fig. 5B,C). Expanding the nominal FDR threshold to a more permissive 10% revealed additional replicated genes (List in Additional File 2: Table S14) across four more tissues: blood (1), colon (1), muscle (1), and skin (2), suggesting that future analyses with larger sample sizes may identify additional NMD-modulating genes and that NMD-modulating variants may manifest in various tissues (Fig. 5B).

We further calibrated these nominal FDRs by comparing against a distribution of FDRs obtained from running SKAT-O on a randomization of the iNMDeff estimates. This resulted in estimates of empirical FDR (Additional File 1: Fig. S23A), where the nominal FDR of 2% corresponded to a conservatively estimated empirical FDR threshold of ~30%, calculated as 3 replicated hits from randomized iNMDeff data, versus 10 hits obtained from our observed iNMDeff values. The 10 gene-tissue pair associations that met our criteria are detailed in Fig. 5C, categorized by pLoF stringency variant set and NMD estimation method. The same table for an FDR of 5% is also available (Additional File 1: Fig. S23B).

To assess the effect size of the associations for 10 replicated genes and determine their direction of effect on iNMDeff, we utilized a gene-burden test through general linear regression (see Methods) to complement the SKAT-O test results. We conducted this estimation separately for each NMD method in both TCGA and GTex, across all tissues (Additional File 1: Fig. S24A-B). Among the genes studied, the chromatin modifier KDM6B stood out in that it demonstrated a consistently significant positive effect size in the thyroid (0.46, SKAT-O FDR = 2%, in ASE), our GTex-discovery tissue, and in the TCGA-matched cancer type THCA (0.98, SKAT-O FDR = 0.5%, in ASE) (Fig. 5D and Additional File 1: Fig. S24A-B). The effect sizes in the ETG method were consistent with the ASE method (0.07 in GTex thyroid, 0.98 in TCGA THCA). The coefficient direction is positive, indicating that individuals with a pLoF variant in KDM6B have a higher iNMDeff than those without (Additional File 1: Fig. S25). Moreover, a positive effect size is also observed in the hypothalamus brain region (BRNHPT in GTex: 2.25 in ASE; 0.63 in ETG), and in the adipose subcutaneous tissue (ADPSBQ: 0.32 in ASE; 0.14 in ETG), with the ASE SKAT-O FDR = 0.55% and 17%, for brain and adipose, respectively. The other tissues and cancer types, while not significant with SKAT-O, usually exhibited a similar positive trend (Additional File 1: Fig. S24A-B), suggesting this KDM6B association with NMD efficiency may not be thyroid-specific but likely impacts other tissues as well.

From the 6 brain-identified replicated genes (Fig. 5C), the consistency of effect sizes (pLoF burden) across the two cohorts (GTex tissues and TCGA tumors) was observed only for COL19A1 (see ASE in Additional File 1: Fig. S24 and S25). For NUP153, consistent effects were observed across various brain tissue samples within GTex but did not extend to TCGA brain tumors, possibly due to normal-cancer tissue differences and/or due to that glioma brain cancers originate from glia rather than neurons (Additional File 1: Fig. S24 and S25). Overall, our analysis provides evidence that deleterious germline variants in diverse genes may alter NMD efficiency in individuals, with the histone demethylase KDM6B being a prime candidate for follow-up.

NMD efficiency impacts somatic selection and cancer patient survival

Previous research [11, 45] reported that somatic nonsense mutations (producing PTCs) in tumor suppressor genes (TSGs) are positively selected specifically in NMD-triggering gene segments. We investigated whether the global NMD efficiency of an individual influences the selection of these mutations in tumors, using a dN/dS analysis across TCGA tumor samples (see Methods).

For TSGs, we first checked missense mutations, which normally do not trigger NMD. There was no significant difference in driver gene selection between high and low iNMDeff individual groups, defined either by the ETG method (Fig. 6A, right panel) or the ASE method (Additional File 1: Fig. S26A, right panel) for iNMDeff (missense mutations located in NMD-triggering regions ETG dN/dS = 1.19 vs 1.20, p = 0.53, and NMD-evading regions dN/dS = 1.03 vs 1.10, p = 0.61, one-sided Wilcoxon paired tests). However, in the case of nonsense (PTCs) mutations, we observed a difference with contrasting directions: genes with NMD-triggering PTC variants exhibited a significant trend towards higher positive selection in tumor samples with high iNMDeff, whereas tumors with low iNMDeff demonstrated less positive selection (dN/dS = 1.64 vs 1.33, p = 4.7e − 02 for ETG iNMDeff, Fig. 6A, left panel), thereby suggesting that individual-level, global NMD efficiency can indeed shape selection on cancer genes (consistent trend in ASE iNMDeff: p = 0.13, Additional File 1: Fig. S26A, left panel). We also observed increased positive selection for genes with NMD-evading PTC variants in tumors with low iNMDeff (Fig. 6A), as expected from selective advantages through NMD-independent mechanisms, likely via complete protein loss-of-function in driver genes or dominant-negative effects of some PTCs [74, 75].

Fig. 6
figure 6

Individual-level NMD efficiency modulates somatic selection in tumor suppressor genes and impacts progression-free survival (PFS) and response to immunotherapy. A The dN/dS ratios (via dNdScv method) of tumor suppressor genes (TSGs) for both NMD-triggering and NMD-evading nonsense (left panel) and missense (right panel) somatic mutations are compared between two groups of individuals with high and low iNMDeff, as determined by the median ETG iNMDeff. Statistical significance is assessed using one-sided Wilcoxon tests on paired data. PTCs were classified as NMD-triggering if they met all criteria: (1) > 250nt from Transcription Start Site (TSS), (2) in exons ≤ 500nt, and (3) not in the last exon or last 55nt of the penultimate exon. Conversely, PTCs were classified as NMD-evading if they met any criteria: (1) ≤ 250nt from TSS, (2) in exons ≥ 1000nt, or (3) in the last exon or last 55nt of the penultimate exon. B Proportion of CD8 + cytotoxic T-cell infiltration in TCGA cohort patients for melanoma (SKCM), pan-kidney cancers (KIRC and KIRP), and breast cancer (BRCA), treated with either immunotherapy or chemotherapy. Immune infiltration was inferred from RNA-seq data using the quanTIseq TIL10 signature from The Cancer Immunome Atlas (TCIA). Tumor samples were classified as high or low NMD efficiency based on top 30% vs bottom 30% ETG iNMDeff estimates, respectively. P-values were calculated using a one-sided Mann–Whitney U test. C–E Kaplan–Meier (KM) survival curves comparing PFS outcomes of the top 20% of individuals with the highest iNMDeff against the lowest 20% in TCGA SKCM. The analysis is divided into three patient categories: no treatment data available (B), those treated with immunotherapy at least once (C), and those treated exclusively with chemotherapy, excluding immunotherapy patients (D). Log-rank test p-values quantify the statistical significance of the survival differences. F,H Validation of the KM curves for PFS in immunotherapy treated patient cohorts from Liu et al. [76] (SKCM, E), Carrol et al. 2023 (EAC—esophageal adenocarcinoma, F), and Motzer et al. 2020 (RCC—renal cell carcinoma, G). I Performance of logistic regression models predicting immunotherapy response based on: (i) iNMDeff alone; (ii) TMB alone; (iii) a combination of iNMDeff, TMB, and their interaction term iNMDeff*TMB. iNMDeff was derived either from the ASE (left panel) or ETG (right panel) methods. Each model’s predictive power is indicated by the area under the curve (AUC) on the Y-axis, with the dataset and sample size (n) specified on the X-axis. A black dashed Line at an AUC of 0.5 represents the threshold for random prediction accuracy and an asterisk denotes when the AUC is significantly different from it at a p-value < 0.05 via one-side DeLong test

We identified strong iNMDeff-associated PTC selection in some individual genes particularly SMAD4 and TP53 and some degree of effect on various other genes (see Additional File 3: Text S6, Additional File 1: Fig. 26B-C and Additional File 2: Table S15 for details). Additionally, we noted tentative associations with higher mutation burden cancer types, including MSI, POLE-mutant, and lung cancers (details in Additional File 3: Text S6).

NMD efficiency is associated with overall survival of cancer patients

In light of our previous findings that PTC-level NMD efficiency can modulate phenotype severity of genetic diseases [11], we explored, here focusing on cancer, whether individual-level NMD efficiency influences cancer survival outcomes. We considered Kaplan–Meier (KM) curves and Cox proportional hazards regression models with overall survival (OS) as our outcome, binarized into “High” and “Low” groups at varying thresholds, across 47 various TCGA cancer types stratified into subtypes, including also pan-cancer, and employing randomized data as a negative control (see Methods).

After applying FDR correction to all 960 tests (480 FDR adjusted meta-p combining ETG and ASE estimates of iNMDeff), we identified 17 in 960 significant tests at a nominal FDR of 5%, compared to 23 of the 9600 randomized tests, yielding an empirical FDR of 16% by randomization (results for individual percentiles given in Additional File 2: Table S16). The significant results spanned four cancer types: SKCM (median FDR meta-p = 0.78%, median hazard ratio (HR) = 1.61), ESCA_ac (FDR = 3.2%, HR = 2.22), PRAD (FDR = 2.8%, HR = 0.11), and LUAD (FDR = 0.75%, HR = 3.49). Of note, in ESCA_ac and LUAD, we were unable to calculate the Cox model for ASE at due to insufficient data; thus, the reported meta-p reflects only the ETG method. In PRAD, higher iNMDeff correlates with increased survival (Additional File 1: Fig. S27A-B). Roles of NMD in carcinogenesis are two-fold, where NMD can boost some tumor-promoting but also some tumor-suppressive mechanisms [30, 77]. We suggest that the balance tips towards the latter option for prostate cancer, where a more active NMD implies a less aggressive tumor. In contrast to the lowly mutated PRAD, in the often high mutation burden SKCM (Additional File 1: Fig. S27C-D), and also ESCA_ac (Additional File 1: Fig. S27E-F) and LUAD (Additional File 1: Fig. S27G-H), higher tumor iNMDeff is associated with reduced patient survival. This finding aligns with proposed approaches to potentiate immune response to tumors by NMD modulation [1, 46, 78, 79]; we note a certain number of SKCM samples in TCGA were from patients who received immunotherapy (73 out of 469; however only 5 LUAD samples in TCGA, and none from ESCA_ac and PRAD, were treated with immunotherapy). We posit that patients with lower iNMDeff might experience better treatment outcomes, as higher NMD efficiency should result in the silencing of frameshifted and/or truncated neoantigenic peptides, relevant to the efficacy of immunotherapy. Consistent with this notion, when assessing the data describing CD8 + cytotoxic T-cell infiltration (inferred from RNA-Seq in the TCGA cohort patients with treatment data), we noted positive trends towards higher CD8 + infiltration in those tumors having lower NMD efficiency. This was observed in cancer types commonly considered immunogenic—melanoma and kidney cancers (KIRC + KIRP)—and interestingly also in breast cancer (Fig. 6B). These findings were further validated in the independent Hartwig Medical Foundation cohort, where we observed significant increases in content (estimated via RNA-seq levels) of multiple CD8 + T-cell subsets in the lower iNMDeff tumors, across breast, kidney, lung, and skin cancers (Additional File 1: Fig. S28).

NMD efficiency predicts progression-free survival in treated patients

As another test of the impact of NMD efficiency on patient outcomes, we stratified patients based on the type of treatment and focused on progression-free survival (PFS), considering treatments separately and comparing the top 20% of individuals with the highest iNMDeff against the lowest 20%. For SKCM patients lacking available treatment data (Fig. 6C and Additional File 1: Fig. S29A for ASE using median iNMDeff), KM curves showed no significant difference (log-rank test) in PFS between the high and low iNMDeff groups (ETG p = 0.73, ASE p = 0.77).

However, a difference emerged among patients who received immunotherapy (along with a maximum of one chemotherapy treatment; Fig. 6D and Additional File 1: Fig. S29B for ASE), noting reduced PFS in the high iNMDeff group compared to the low iNMDeff group (ETG p = 9.2e − 2 and ASE non-significant). The same trend was observed when discretizing at different thresholds (percentiles: 20 to 30, HR = 1.77–2.02, FDR = 32%). The borderline significance here might be attributed to the small number of SKCM patients treated with immunotherapy (n = 73). The TCGA pan-cancer analysis (includes all immunotherapy treated patients, n = 184) revealed a trend at the 20–25th percentiles (HR = 1.31–1.88, FDR = 19%) (Additional File 2: Table S17).

Next, we asked if this effect of NMD efficiency on survival is particular to immunotherapy, or it extends also to other types of therapies. Among SKCM patients treated with chemotherapy, but not immunotherapy (Fig. 6E, Additional File 1: Fig. S29C), a similar but stronger trend was observed (ETG p = 0.17 and ASE p = 0.16; ETG HR = 2.81, p = 5.1e-2). Efficient NMD is more strongly associated with poorer tumor response to chemotherapies compared to immunotherapies in melanoma.

We next asked if other cancer types apart from SKCM show associations between iNMDeff and response to chemotherapy in terms of PFS. Again, from within the patients receiving chemotherapy, we compared the high iNMDeff against the low iNMDeff group. Indeed, we identified four additional cancer types at an FDR-adjusted meta-p < 10% (results for individual percentiles given in Additional File 2: Table S18): pancreatic adenocarcinoma (PAAD, median HR = 2.16, median FDR = 3.8%), LGG (HR = 1.57, FDR = 4.5%), bladder cancer (BLCA_Basal_scc subtype, ETG: HR = 3.99, FDR = 3.4%), and ESCA_scc (ETG: HR = 2.5, FDR = 0.047%). No other cancer types were found with an opposite-direction trend for the same FDR threshold. The positive HR, highlighting the worse PFS in chemotherapy-treated patients whose tumors have higher NMD efficiency, extends to other cancer types. At a more relaxed FDR threshold of 25%, caution must be taken in interpreting hits as we identified additional cancer types, some with an opposite direction (i.e. better PFS, Additional File 2: Table S18).

Association of iNMDeff with immunotherapy response in diverse cancer types

To validate the association of NMD efficiency with immunotherapy response, we analyzed independent datasets that had both RNA-seq and WGS/WES available: (i) Liu et al. [76], comprising 144 melanoma patients treated with anti-PD-1 immune checkpoint inhibitors (ICI); ii) Carroll et al. [80], involving 35 esophageal adenocarcinoma (EAC) patients who underwent first-line immunochemotherapy followed by chemotherapy; (iii) Motzer et al. [81], consisting of 886 patients with advanced renal cell carcinoma (RCC) treated with anti-PD-L1 or anti-PD-1. We applied our abovementioned iNMDeff proxy model (originally developed to predict cell-line NMD efficiency using global gene-level expression, see Methods) to estimate iNMDeff for each patient in these cohorts, yielding iNMDeff for 120 patients in Liu-SKCM dataset, 33 in Carroll-EAC dataset, and 353 in Motzer-RCC dataset.

The KM curves for PFS (Fig. 6F–H and Additional File 1: Fig. S28D-F) demonstrated a consistent trend across the three immunotherapy studies for iNMDeff. In specific, this trend was evident in melanoma (Fig. 6F and Additional File 1: Fig. S29D; ETG p = 0.16 at 20th percentile, ASE p = 0.21, at 40th percentile), esophagus (Fig. 6G and Additional File 1: Fig. S29E; discretized at percentile 25th, ETG p = 0.27, ASE p = 0.7), and renal cancer (Fig. 6H and Additional File 1: Fig. S29F; ETG p = 0.17 at the 30th percentile, ASE p = 0.15 at the percentile 45th). The combined meta-analysis p-value (by Fisher’s method) from the 3 studies would be p = 0.13 and p = 0.27, for ETG and ASE, respectively. When analyzing the data using the Cox proportional hazards model, including covariates such as sex, age, and TMB (see Methods) (Additional File 2: Table S19), the HR from all three studies trended in the same direction for both iNMDeff estimation methods in Liu-SKCM (ETG: HR = 1.68; ASE: HR = 1.37) and Motzer-RCC (1.29 and 1.23), and for the ETG method in Carroll-EAC (1.50; not supported in ASE).

Tumor mutation burden (TMB) is a well-established predictor of immunotherapy response, likely due to its association with neo-antigen content. In a predictive model, we combined TMB with iNMDeff, aiming to classify immunotherapy responses into responders versus non-responders (see Methods). This model was applied across the previous datasets TCGA-SKCM, Carrol-EAC, and Liu-SKCM, excluding Motzer-RCC due to missing TMB and treatment response data. Additionally, we included new datasets of immunotherapy-treated patients from Hartwig Medical Foundation[82] (urinary tract, n = 23, lung, n = 26 and melanoma, n = 119) and Riaz et al. (melanoma, n = 48) [83], totaling seven datasets for this analysis.

The TMB model’s predictive accuracy was significantly enhanced by including iNMDeff, showing a 7.4% mean increase (for both ASE and ETG) in correctly classified patients compared to a TMB-only model, averaging across the datasets (Fig. 6I and Additional File 2: Table. S20). The area under the precision-recall curve (AUC) increased notably, from 0.704 (TMB-only) to 0.778 (TMB + iNMDeff), averaged across the datasets. Notably, this improvement manifested consistently but at variable strength across the different datasets: 1–9% in Riaz-SKCM, 11–27% in TCGA-SKCM, 1–2% in Hartwig-SKCM, 11–25% in Hartwig-Urinary, 3–5% in Hartwig-Lung and 1–10% in Carrol-EAC. Liu-SKCM did not exhibit improvement. Regarding other performance metrics, averaged across datasets: recall (sensitivity) increased from 0.89 to a range of 0.85–0.91 (for iNMDeff by ETG and ASE, respectively), and precision improved from 0.72 (TMB-only) to a range of 0.74–0.78 by including iNMDeff.

In conclusion, individual-level NMD efficiency can impact cancer patient survival: there was a trend of reduced PFS in immunotherapy-treated patients with high NMD efficiency, observed consistently across 3 cancer types. Furthermore, including iNMDeff improves predictive models of immunotherapy response drawing on TMB. For context, we note there is a strong prior that a globally more efficient NMD in a patient would hinder immunotherapy, based on recent work associating NMD inactivity upon individual indel mutations in a tumor and immunotherapy response [11, 44].

Continue Reading