Ethical regulations
This study was conducted under approved UKB application no. 56844. Clinical samples were obtained with informed written consent from the Cambridge Blood and Stem Cell Biobank with approval by the Cambridge East Research Ethics Committee (REC) (REC 18/EE/0199 and 24/EE/0116), from the SardiNIA longitudinal study of immune senescence (REC 15/EE/0327) with approval by the East of England (Essex) REC, or from the Manchester Cancer Research Centre Biobank with approval by the South Manchester REC (REC 07/H1003/161+5; HTA license 30004).
Statistics and reproducibility
In this project, we included 454,340 UKB participants with somatic variant call data from our previous study19. From this group, participants who had withdrawn consent, had a mismatch between genetic and self-reported sex or had differences in the dates of attending the assessment center and the blood sample collection, were excluded from the study resulting in n = 454,098 participants. Power calculations were conducted to determine the minimum number of cases for inclusion (‘Mutation Calling’). These analyses were not randomized, and the investigators were not blinded to allocation during experiments and outcome assessment.
Mutation calling
Mutations in 41 CH driver genes (Supplementary Table 13) were called using Mutect2 GATK v.4.1.3.0 from whole-exome sequencing (WES) data of peripheral blood DNA from 454,340 UKB participants and filtered as described previously19 (Methods). A specific VAF cutoff was not used to define participants with CH. Mutations in DNMT3A at the hotspot R882 were grouped as ‘DNMT3A_R882’ and the rest as ‘DNMT3A_other.’ U2AF1 mutations were identified using Samtools mpileup (v.1.15.1) and the variants with at least three alternate allele reads and a VAF ≥ 0.05 were included in the analysis. Participants who were diagnosed with hematological malignancy before recruitment were removed from all analyses involving LTL. Participants harboring mutations in several genes, or mutations in less frequently mutated genes (<100 cases), were excluded from LTL and LTL-PRS analyses. The threshold of 100 cases was chosen following power calculations performed using the ‘samplesizelogisticcasecontrol’ package (v.2.0.2) in R (Methods). An exception to this threshold was made for mutations in the splicing factor gene U2AF1 (n = 82) in light of its recently reported association with CH in TBD23. We also excluded ATM, BRCC3 and STAT3 from downstream analysis as these are not widely recognized as drivers of myeloid CH54. Somatic mutations in the ‘All of Us’ cohort55 were identified as described previously56.
Mosaic chromosomal alterations
mCA calls were obtained from Loh et al.57. Before analyses, participants carrying several mCAs or any CH driver gene mutations or mCAs of unknown copy number change/cell fraction were filtered out. Based on the chromosome and the type of copy number change, mCAs were grouped into autosomal mCAs (any type of copy number change), LOX and LOY.
PRS calculation
We used PRSice-2 (v.2.3.5) to compute PRS associated with telomere length based on the 131 SNPs identified in the GWAS by Codd et al.24 with beta coefficients from the same study serving as weights in the PRS computation. Imputed genotypes available in the UKB were used for this analysis. Calculated PRS were Z-normalized. Participants with a prevalent hematological diagnosis were not excluded for LTL-PRS analyses, as those people would have developed CH at some stage before development of malignancy.
Myeloid malignancy phenotypes
UKB participants with a prevalent diagnosis of hematological malignancy were defined using ICD codes (Supplementary Table 14). If a participant had several myeloid neoplasms, only the first diagnosed disease was considered for analysis. People who had chemotherapy before diagnosing myeloid malignancies were excluded from the association analysis with LTL and LTL-PRS.
Regression analyses
All linear and logistic regression analyses were performed using the Python (v.3.9.7) module statmodels (v.0.12.2). First, the association between the presence of a CH mutation and LTL was investigated using a linear regression model on LTL with binary predictor variables representing presence/absence (1/0) of mutations in each of the CH driver genes and covariates. For quantifying the variation in telomere length with respect to VAF, a linear regression model for predicting telomere length was built with the variables (Gene + Gene VAF)for all genes where (Gene + Gene VAF)for all genes = DNMT3A + DNMT3A VAF + ASXL1 + ASXL1 VAF + TET2 + TET2 VAF and so on for all genes and covariates. DNMT3A, ASXL1 and so on are variables that represent whether a mutation is present (1) or not (0) in the specific gene. The covariates used were sex, age, smoking status, genetic principal components from one to ten, white blood cell counts and percentages of types of white blood cell. Blood-count-related parameters were winsorized to 99% before regression. Similar analysis was performed for mCAs using cell fraction instead of VAF. Correction for multiple testing was performed using the Benjamini–Hochberg procedure and applying a threshold of FDR < 0.05.
Logistic regression analyses were performed to quantify the association between polygenic risk scores and CH/mCA. Age, sex, smoking status and first ten genetic principal components were used as the covariates in the regression. Correction for multiple testing was performed using the Benjamini–Hochberg procedure and applying a threshold of FDR < 0.05.
Mendelian randomization
The same set of variants as used in PRS calculation were employed as genetic instruments in the MR analyses to identify causal associations between telomere length and various types of CH. Coefficients quantifying the association between each of the genetic instruments and each of CH types were obtained by Firth’s logistic regression analysis performed using the logistf function in R (v.4.2.1). MR analyses were performed using the TwoSampleMR package (v.0.5.7) in R (v.4.3.0) using these coefficients along with the coefficient estimates for association between genetic instruments and telomere length from Codd et al.24 and the results were reported for the inverse-variance-weighted method. Correction for multiple testing was performed using the Benjamini–Hochberg procedure and applying a threshold of FDR < 0.05.
Analysis of TERTp mutations in the UKB
TERT promoter mutations we identified from WGS of blood DNA from 488,364 UKB participants as this region is not captured adequately by the UKB WES panel. We used samtools mpileup (v.1.15.1) to identify single nucleotide variants (SNVs) across the entire TERT promoter (chr5:129489–1295157) with high sensitivity and then applied several manual filters (depth ≥ 15 bp, at least three supporting reads, VAF ≥ 30%). This approach was used in place of somatic variant calling pipelines due to the low depth of WGS across the promoter (median 34×). We then focused our subsequent analysis on three mutational hotspots identified previously as somatic rescue mutations in TBD (chr5:1295046:T:G, chr5:1295113:G:A and chr5:1295135:G:A)22,23. To benchmark our approach for calling TERTp hotspot mutations from WGS data, we used the same approach to call hotspot mutations in SF3B1 (R625, K666 and K700) and SRSF2 (P95) from WGS, filtered them as described above, and compared their age-related prevalence to TERTp-CH, as well as SF3B1/SRSF2-CH identified from WES. A detailed outline of the approach used to call TERTp mutations and subsequent benchmarking is contained in Supplementary Note 2.
Construction of phylogenetic trees from WGS of hematopoietic cell colonies
We analyzed data from a man aged 83.8 years with SF-CH detected in blood DNA (PD34493: U2AF1-Q157R 10.3%, SF3B1-K666N 8.7%, NOTCH1-L441L 0.3%), studied previously by phylogenetic analysis using WGS of single-HSPC-derived colonies7. Specifically, for this study, we also performed colony WGS and phylogenetic analyses on samples from a woman aged 73.9 years with SF-CH (PD41082: TET2-Q1825X 33.8%, SF3B1-K666N 7.1%, TET2-S315fs 3.2%, GNB1-K57E 1.5%, TET2-L1322Q 1.3%, TET2-H435fs 1.2%, TET2-Q1274R 1.1%, TET2-Q1542X 0.8%). Both were participants in the SardiNIA study58 and were studied because they harbored SF-CH with sizeable clones7. Ninety-six colonies per person were picked from methylcellulose-based medium previously plated with peripheral blood mononuclear cells (PBMCs) and used for WGS as described previously7,51. To investigate trends in clonal expansion and telomere length over time, heterochronous peripheral blood samples were taken from a man with SF3B1-CCUS (PD48499) aged 50.2 years (n = 24 colonies, SF3B1-K700E 42.4% on clinical NGS of bone marrow DNA) and SF3B1-MDS at age of 53.8 years (n = 72 colonies, SF3B1-K700E 42.8% on clinical NGS of bone marrow DNA). This man was selected because of the presence of SF-CH and availability of longitudinal blood samples.
Phylogenetic relationships were derived from colony WGS data as described previously7,59,60. Briefly, reads were aligned to the human reference genome (GRCh38) using BWA-MEM (https://github.com/lh3/bwa). Variant calling was performed using CaVEMAN61 (SNV) and Pindel62 (indels) against an in silico generated unmatched normal. Colonies with low sequencing depth (<6×) or low clonality (median VAF < 0.4) were removed from downstream analyses. Filtering was performed to remove germline variants and artefacts arising from low DNA input, using pooled information across per-person colonies as outlined previously59,60. For all mutations passing quality filters in at least one colony, matrices were generated of mutant and normal reads at each site for every colony from the same person, using vafCorrect (https://github.com/cancerit/vafCorrect) to correct for reference bias arising during alignment of reads containing indels. Genotype matrices of SNVs were used as input to MPBoot63 to infer the phylogenetic relationships between colonies using a maximum parsimony approach with bootstrap approximation. The treeMut package (https://github.com/nangalialab/treemut) was then used to assign mutations (SNVs and indels) to branches and estimate branch lengths. To convert the x axis of each phylogenetic tree from number of mutations to chronological age, where the tips of the tree are the age of the person at sampling, we used the package Rtreefit59 (https://github.com/nangalialab/rtreefit) to scale branch lengths, accounting for differences in mutation rate across the human lifespan and intersample variation in the sensitivity of detecting somatic variants.
Telomere length estimation from WGS data
Telomere length estimates were estimated from the NovaSeq-sequenced colony WGS data described above using Telomerecat64. Novaseq’s two-dye technology interprets the absence of signal from a failed cluster as a run of ‘G’ base calls that can confound Telomerecat due to its resemblance to the telomere sequence (TTAGGG). The likelihood of cluster failure increases with read length; hence, we ran Telomerecat with the ‘-trim 75’ argument to estimate telomere lengths from the first 75 bp of each read and avoid the higher error regions towards the end of the read. Phylogenetic trees were then annotated with telomere length estimates using the ggtree (v.3.8.2) package in R65.
Pairwise comparison of telomere length estimates were performed using the Wilcoxon rank sum test. Alongside this, we also fitted a linear mixed effects model using the lme4 package (v.1.1) in R66 to model colony telomere length with sequencing batch as a random effect and genotype and age as fixed effects (Supplementary Note 1):
$${rm{Colony}; telomere; length} sim {rm{Age}}+{rm{Genotype}}+(1{|rm{Batch}})$$
This model was fitted on all colonies passing filters and included in the final phylogenetic trees (n = 248). To test the hypothesis that genotype (splicing factor driver mutation/other driver mutation/driverless) is associated with colony telomere length at a cohort level, we compared linear mixed effects models with and without genotype as a fixed effect and compared both models using one-way analysis of variance (ANOVA). Confidence intervals (CIs) for fixed effect coefficients were estimated using bootstrap resampling with 10,000 resamples and calculating the 95% CI for each coefficient based on the first 5,000 converged models.
Cell-line culture
K562 were cultured in IMDM (Gibco, cat no. 12440053) supplemented with 10% FBS (Gibco, catalogue number SH30071.03), 2 mM l-glutamine and 1% penicillin/streptomycin. OCI-AML2 were cultured in α-MEM (Gibco, catalogue number 12571063) supplemented with 20% FBS, 2 mM l-glutamine and 1% penicillin/streptomycin. HEK293FT were cultured in DMEM (Gibco, catalogue number 11960085) supplemented with 10% FBS, 2 mM l-glutamine and 1% penicillin/streptomycin and passaged using trypsin. Cells were maintained at 37 °C and 5% CO2 in a humidified incubator and passaged every 2–3 days. Cas9-expressing cell lines were generated using lentivirus generated from pKLV2-EF1aBsd2ACas9-W plasmid (Addgene, catalogue number 67978) as described below.
Lentivirus generation and transduction
Tissue culture plates (15 cm2) were coated in 0.1% gelatin for 37 °C for 30 min. Plates were washed with PBS (Sigma, catalogue number D8537-500) and seeded with 8 × 106 HEK293FT cells. Vector plasmid (7.5 μg) was mixed with 18.5 μg of pPAX2 (Addgene, catalogue number 12260), 4 μg of pMD2.G (Addgene, catalogue number 12259), 30 μl of PLUS reagent and 7.5 ml of Opti-MEM (Gibco, catalogue number 51985026) and incubated at room temperature for 5 min. Lipofectamine LTX (180 μl; Invitrogen, cat no. 15338030) was added, and the mixture was incubated at room temperature for an additional 30 min. After this, the transfection mixture was added dropwise to cells followed by 20 ml of HEK293FT medium (prepared as above) and placed in a humidified incubator overnight. Medium was changed the following morning. On day 2, viral supernatant was filtered with 0.45 μM low-protein binding filter (Nalgene, catalogue number 190-2545), mixed with Lenti-X (Takara Bio, catalogue number 631232) and kept at 4 °C overnight. Viral supernatant was then spun at 1,500g for 45 mins at 4 °C and the pellet was resuspended in 300 μl of ice-cold PBS.
Concentrated virus (15 μl) was added to 1 × 105 cells in 1 ml of medium supplemented with 6.7 μg ml−1 polybrene. Cells were centrifuged at 870g and 37 °C for 1 h and returned to the incubator. Following 2 days in culture, transduced cells were selected by supplementing medium with 10 μg ml−1 blasticidin or 1 μg ml−1 puromycin for 5 days.
TERT knockout and validation
Two gRNAs targeting TERT exon 2 (Supplementary Table 15) were cloned into the pKLV2-U6gRNA5(BbsI)-PGKpuro2ABFP-W vector (Addgene, catalogue number 67974) and lentivirus was generated and transduced as described above. Transduced cells were selected using 1 μg ml−1 puromycin and maintained in culture for a total of 14 days. Cells (1 × 106) cells were transferred to a 1.5 ml tube and centrifuged at 300g for 5 min and supernatant was discarded. Genomic DNA was extracted from the pellet using the DNeasy Blood and Tissue Kit (Qiagen, catalogue number 69504). DNA was quantified and diluted in UltraPure DNase/RNase-Free Distilled Water (Invitrogen, catalogue number 11538646). TERT gRNA activity was validated using PCR with primers spanning the region of interest (Supplementary Table 15) followed by Sanger sequencing.
PCR was performed on 1 ng of diluted gDNA using HiFi HotStart ReadyMix (Kapa, catalogue number 07958927001) and primers spanning the TERT region of interest using the following reaction conditions: 95 °C for 3 min, 35 cycles of (98 °C for 20 s, 60 °C for 15 s, 72 °C for 30 s) and 72 °C for 5 min. PCR product was purified using QIA quick PCR Purification Kit (Qiagen, catalogue number 28104) and submitted for Sanger sequencing with the forward primer using GeneWiz. Sequencing traces were analyzed in SnapGene to confirm TERT gRNA activity.
Clinical samples
Peripheral blood was collected into lithium heparin tubes (Sarstedt, catalogue number 02.1065.001) and bone marrow aspirate was collected in RPMI (Gibco, catalogue number 21875034) supplemented with 1% penicillin/streptomycin and 10 IU ml−1 sodium heparin (Merck, catalogue number H3149-10KU). Samples were processed using Ficoll (Merck, catalogue number GE17-1440-02) and/or PharmLyse (catalogue number BD 555899) to isolate MNCs, leukocytes or granulocytes. Cells were used immediately in experiments or cryopreserved in FBS supplemented with 50% human serum albumin and 10% dimethylsulfoxide and stored for future use.
Colony-derived WGS
Samples were plated to form colonies and prepared for WGS as described previously7,51. Briefly, peripheral blood or bone marrow MNCs were plated at 3 × 106cells ml−1 in MethoCult H4034 (Stemcell Technologies, catalogue number 04034) and cultured in a humidified incubator at 37 °C and 5% CO2 for 14 days. Colonies were picked and resuspended in RLT (Qiagen, catalogue number 79216). Libraries were prepared using a low-input pipeline and 150 bp paired-end sequencing was performed on a NovaSeq 6000 at 15× coverage.
DNA extraction and quantification
For cell lines, genomic DNA was isolated using DNeasy Blood and Tissue Kit and quantified using the Qubit dsDNA HS Kit (Invitrogen, catalogue number Q32851). Telomere qPCR (Supplementary Methods) on colonies lysed in RLT was attempted but yielded poor and inconsistent results, particularly at higher RLT concentrations and low DNA input (Supplementary Fig. 2). Instead, cells were plated as described above and picked after 14 days into 17 μl of PicoPure (Applied Biosystems, catalogue number KIT0103) buffer supplemented with Proteinase K according to the manufacturer’s instructions, lysing the cells. Lysate was placed in a thermocycler under the following conditions: 65 °C for 6 h, 75 °C for 30 min, 4 °C hold. Volume was made up to 50 μl with UltraPure H2O and DNA was quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen, catalogue number P7589).
Flow-FISH
Cryopreserved cells were thawed and washed twice in warmed RPMI supplemented with 10% FBS. Cells were centrifuged at 300g for 5 min and resuspended in FACS buffer (PBS supplemented with 0.1% BSA (Fisher, catalogue number BP9702-100)). Cells were counted and 1–3 × 106 cells were aliquoted into 1.5 ml tubes. Cells were centrifuged at 300g for 5 min and resuspended in 1 ml PBS containing 1:1,000 Fixable Viability Dye eFluor 780 (eBioscience, catalogue number 65-0865-14) and incubated at 4 °C in the dark for 20 min. Following this, cells were washed twice in FACS buffer. For the CLL sample only, cells were then centrifuged at 300g for 5 min and resuspended in FACS buffer supplemented with the following antibodies: 1:100 CD3-BUV395, 1:160 CD19-BV421, 1:160 CD11b-PE, 1:100 CD33-BV510, 1:50 CD5-FITC (Supplementary Table 16). Cells were incubated at 4 °C in the dark for 20 min and washed twice with FACS buffer and sorted as described below.
Following sorting (CLL sample) or viability staining (remaining samples), cells were centrifuged at 300g for 5 min at resuspended in 250 μl of hybridization buffer (70% formamide (Thermo Scientific, catalogue number 17899), 20 mM Tris (Thermo Scientific, catalogue number AM9850G) and 0.1% BSA in water) containing 0.3 μg ml−1 TelC-Alexa647 (PNA Bio F1013) and 0.3 μg ml−1 CENPB-Alexa488 (PNA Bio, catalogue number F3004) PNA probes which had been heated briefly at 55 °C for 5 min and vortexed before addition. Cells were heated at 80 °C for 10 min and incubated overnight at room temperature in the dark.
The following morning, cells were centrifuged at 300g for 7 min at 16 °C and resuspended gently in 1 ml of formamide wash buffer (70% formamide, 10 mM Tris, 0.1% Tween 20 (Sigma, catalogue number P1379) and 0.1% BSA in water). This step was repeated once more. After this, cells were centrifuged at 300g for 7 min at 16 °C and resuspended gently in 1 ml of PBS wash buffer (PBS supplemented with 0.1% Tween 20 and 0.1% BSA). Finally, cells were centrifuged at 300g for 5 min at 16 °C, resuspended in 500 ml of FACS buffer supplemented with 10 mg ml−1 RNase A (Invitrogen, catalogue number 12091021) and transferred to FACS tubes through a 40-μm cell strainer (Fisher, catalogue number 22363547). Cells were sorted using a BD FACSAria Fusion flow cytometer. For each sample, a small proportion of cells were analyzed to give the distribution of telomere lengths within that sample and then sorting gates were set by the specified percentile telomere length ranges.
DNA was extracted from sorted populations using PicoPure and quantified as described above. Purified DNA was prepared for Sanger sequencing (as described above, PCR annealing temperature optimized for each primer pair) alongside targeted amplicon sequencing (Supplementary Table 17; Supplementary Methods).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.