A stem cell differentiation model reveals two alternative fates in CBFA2T3::GLIS2-driven acute megakaryoblastic leukemia initiation

Human induced pluripotent stem cell culture and in vitro hematopoietic differentiation

A commercially available female hiPSC line was used (Gibco A18945; https://hpscreg.eu/cell-line/TMOi001-A), which has been episomally reprogrammed from normal cord blood. To reduce mutation rates, hiPSCs were routinely cultured under hypoxic conditions (37 °C, 3% O2, 5% CO2) on plates coated with hESC-qualified Matrigel (Corning) in TeSR E8 or mTeSR-Plus medium (STEMCELL Technologies). Cells were passaged every 3–4 days mostly at a split ratio of 1:6 using StemPro Accutase (Thermo Fisher Scientific), and the medium was exchanged every 1–2 days. Ten µM Rho-associated coiled-coil-containing protein kinase inhibitor Y-27632 (ROCK inhibitor; STEMCELL Technologies) was added for splitting, transfecting, freezing (CryoStor CS10 medium; Biolife Solutions), and thawing. Mycoplasm contamination was excluded by regular testing using a luminescent detection kit (Lonza MycoAlert). The STEMdiff Hematopoietic Kit (STEMCELL Technologies) was used for differentiation according to the manufacturer’s protocol (Document #29768 v1_2_0) with minor changes27,31,79.

Briefly, 2000 hiPSCs were seeded as clumps per well of a Matrigel-coated 12-well plate. Differentiation was started 3 days later under normoxic cell culture conditions with STEMdiff hematopoietic differentiation basal medium with supplement A (containing BMP4, FGF2, VEGFA) for the first day with 3 µM CHIR99021 glycogen synthase kinase 3 inhibitor (Sigma-Aldrich) and another 2 days without. Next, cells were cultured for 9–10 days in differentiation medium with supplement B (containing BMP4, FGF2, VEGFA, SCF, FLT3L, TPO) with half media changes every 2–3 days. Dimethylsulfoxide only (DMSO; 140 µM final concentration; Serva; CG-positive vehicle control) or for CG suppression dTAG-13 compound (100 nM final; kindly provided by Nathanael Gray, Dana-Farber Cancer Institute, Boston, MA, USA; CG-negative CBFA2T3 heterozygous isogenic control) was added from differentiation day 2 onwards. Suspension cells enriched for hematopoietic stem and progenitor cells (HSPCs) were harvested from supernatants on differentiation day 12 or 13 (on average 1.5 × 105 per well of a 12-well plate; Supplementary Fig. 4g) while the remaining adherent cell fraction was used for RT-qPCR, Western blotting, and immunofluorescence. Suspension cells were used for clonogenicity assays or further cultured in StemPro-34 serum-free medium (Gibco) containing 50 ng/ml SCF, 20 ng/ml IL3, 20 ng/ml IL6, 20 ng/ml GM-CSF, 20 ng/ml G-CSF, 20 ng/ml TPO, and 10 ng/ml EPO (all from PeproTech). Additional cell samples were collected on day 22 or 25 of differentiation or after long-term liquid culture (days 72–115). M-07e (ACC 104) and MOLT-4 (ACC 362) cell lines were obtained from the German Collection of Microorganisms and Cell Cultures GmbH (DSMZ) and cultured in RPMI 1640 GlutaMax containing 10% fetal bovine serum, 1x Penicillin/Streptomycin (all from Thermo Fisher Scientific), and M-07e additionally with 10 ng/ml IL3.

Genome editing

In trans paired nicking27,28,80,81 was used to introduce into the CBFA2T3 locus (exon 12 of transcript NM_005187.6) the GLIS2 coding part of exons 3–6 (Ensembl ENST00000433375.1; corresponding to exons 4–7 of GenBank transcript NM_032575.3) fused to C-terminal degradation (dTAG; 3xGGGGS linker plus FKBP1A F36V mutant) and tandem hemagglutinin epitope (2x HA) tags. In addition, subsequent internal ribosome entry site-driven monomeric Kusabira Orange 2 (IRES-mKO2) and floxed Pgk1 promoter-driven puromycin resistance (PuroR) cassettes were inserted. For the knock-in (floxed allele), 250 pmol of ribonucleoprotein complex consisting of crRNA targeting the last CBFA2T3 exon (protospacer sequence 5’-TGCAACGCGGCACGCTACTG-3’; oligonucleotide sequences listed in Table 1), tracrRNA and Streptococcus pyogenes Cas9 D10A recombinant protein (Alt-R reagents from Integrated DNA Technologies) were transfected along with 5 µg CG Nick2 knock-in donor vector (sequence in Supplementary Information) into 106 hiPSCs using Nucleofector 2b, program A-023 and Human Stem Cell Kit 2 (Lonza). Following puromycin selection (Sigma; two 0.5 µg/ml pulses for 1 day each), Cre recombinase was transiently expressed by transfection with pCAGGS-Cre vector82 to remove PuroR (resulting in the delta allele). Single clones were grown from the cell bulk on Synthemax II-SC (Corning) coated 10 cm diameter dishes in mTeSR-Plus medium containing 10% CloneR (STEMCELL Technologies), picked and split into two replicate 96-well plates. After expansion of the cells, crude cell lysates were prepared from one plate and used as templates for genotyping PCRs. Positive clones were expanded from the other replicate plate and further validated.

Table 1 Oligonucleotides used in this study

Magnetic live cell isolation and clonogenicity assays

Hematopoietic differentiation culture supernatants were harvested on day 12 or 13, filtered through a 70 µm strainer, and live cells were purified using the MACS dead cell removal kit (Miltenyi). After Trypan Blue exclusion cell counting in a Bürker-Türk chamber, 1000 HSPC-containing suspension cells were seeded into 6-well plates with 3.3 ml methylcellulose matrix (enriched MethoCult H4435; STEMCELL Technologies) containing cytokines (SCF, IL3, IL6, EPO, G-CSF, GM-CSF) and fetal bovine serum to determine myeloid clonogenicity. Colony-forming unit (CFU) numbers and types (-M: macrophage, -G: granulocyte, -E: erythrocytes, -GM: granulocyte/macrophage, -GEMM: granulocyte/erythrocyte/macrophage/megakaryocyte) and counts of harvested total live cells were determined after 14 days of semisolid culture. For megakaryocyte clonogenicity assays, 4000 suspension cells were seeded into a 2-well chamber slide with 1.6 ml MegaCult matrix (STEMCELL Technologies) containing cytokines (IL3, IL6, TPO) and collagen. After 10–12 days of semisolid culture, slides were dehydrated, fixed, and stained for CD41 MK marker (red) as well as cell nuclei (Evans blue). Colony number, type (CFU-Mk, non-Mk and mixed) and size were enumerated by two independent operators, each from randomly chosen 10% of the covered slide area and then averaged.

Genotyping

Genotyping PCRs for individual clones were performed using 500 nM specific primers, approximately 100 ng genomic DNA, and HotStarTaq DNA polymerase (Qiagen). DNA was isolated first by crude cell lysis in genotyping buffer (10 mM Tris-HCl pH 8.5, 50 mM KCl, 2 mM MgCl2, 0.45% Tween 20, 0.45% Nonidet P40 substitute, 1 mg/ml Proteinase K; all from Sigma-Aldrich) for 3 h at 65 °C and 10 min at 94 °C, and later, after cell expansion, with the QIAamp DNA Blood Mini Kit (Qiagen). PCR was performed with HotStarTaq, for GC-rich regions 20% Q-solution was added (Qiagen). For the 5′ flanking CG KI PCR CBFA2T3in11-F1 and GLIS2ex6-R1, for the 3’ delta CG KI PCR mKO2-mid-F1 and CBFA2T3ex12-R6, for the 3′ floxed CG KI PCR PuroRmidF2 and CBFA2T3ex12-R6, and for the CBFA2T3 wildtype (WT) PCR CBFA2T3in11-F1 and CBFA2T3ex12-R3 primers were used (oligonucleotide sequences are listed in Table 1). 5’ KI and WT PCR were also conducted simultaneously in a 3-primer reaction (CBFA2T3in11-F1, CBFA2T3ex12-R3 and GLIS2ex6-R1). DNA was size separated by electrophoresis on 1% agarose gels in 0.5x Tris/borate/EDTA buffer (uncropped gels in Supplementary Fig. 1 and Supplementary Data 7). PCR products were 1655 bp for the 5′, 1555 bp for the 3′ flanking region of the recombined and 1935 bp for the floxed knock-in allele, and 1319 bp for the WT allele. Products were purified using the Monarch kit (New England Biolabs), Sanger-sequenced with PCR primers (Microsynth) and aligned to the knock-in or WT alleles. Half of 96 screened clones (48) were positive in the 3’ delta KI PCR (Supplementary Fig. 1a). However, 11 of these clones were still positive for the 3’ floxed KI PCR (Supplementary Fig. 1b) suggestive of contamination with non-recombined CG cells, and another 10 clones could not be expanded from the replicate plate. A heterozygous single-nucleotide polymorphism (SNP; rs1239485675) upstream of the 5′ homology arm (5’HA) allowing for purity assessment of the knock-in and WT haplotypes, respectively, revealed that 24 further clones were contaminated with WT cells. The 3 remaining clones turned out to be pure and heterozygous (A2, A9, E5; Supplementary Fig. 2).

Reverse transcription quantitative PCR

Reverse transcription quantitative PCR was performed as described before27. Total RNA of hiPSCs and differentiated derivative cells was extracted using TRIzol reagent (Thermo Fisher Scientific) following the manufacturer’s protocol with glycogen as co-precipitant. Complementary DNA (cDNA) was synthesized from 2 µg total RNA using 500 ng each of random and oligo-dT18 primers, and M-MLV reverse transcriptase (Promega). RT-qPCRs for POU5F1, MYC, SOX2, NANOG, CBAF2T3::GLIS2, CBFA2T3, GLIS2, GUSB, and ABL1 were conducted in triplicates on a 7500-Fast cycler (Applied Biosystems) with cDNA corresponding to 40 ng total RNA per 20 µL reaction using 200 nM forward and reverse primers and 1x iTaq Universal SYBR-green Supermix (Bio-Rad). PCR efficiency of >90% was verified by standard dilution series and specificity by melt curve analyses. Relative quantification was performed by normalization to ROX reference dye, GUSB and ABL1 housekeeping gene expression, and either parental hiPSCs or M-07e cell line using the 2−∆∆Ct method within 7500 Software 2.3 and DataAssist 3.01 (Applied Biosystems). Oligonucleotide sequences are listed in Table 1.

Western blotting

Cells were washed with Dulbecco’s phosphate buffer saline (DPBS) and lyzed in high salt buffer (20 mM Tris-HCl pH 7.5, 400 mM NaCl, 0.5% NP-40, 0.3% Triton X-100, 0.2 mM phenylmethylsulfonyl fluoride, 1 µg/mL each of Aprotinin, Leupeptin and Pepstatin A). Lysates were cleared, denatured for 5 min at 95 °C after addition of 5x loading buffer (0.25 M Tris-HCl pH 6.8, 10% sodium dodecyl sulfate, 0.5 M dithiothreitol, 50% glycerol, bromophenol blue) and along with prestained PageRuler protein ladder (Thermo Scientific) subjected to SDS-PAGE using either self-cast 8% polyacrylamide or 5–15% gradient TGX gels (Bio-Rad) in Tris-Glycine buffer. Tank-blotted membranes (GE Amersham Protran 0.45 µm NC) were stained with Ponceau S (Sigma-Aldrich) to check for equal loading, scanned, incubated with 1x blocking reagent (Roche), primary and secondary antibodies labeled with DyLight 800 or 650, and scanned on a LI-COR Odyssey. Local background subtracted band signal intensities were quantified using Image Studio Lite 5.2.5 (Licor). Uncropped blots and source data are shown in Supplementary Data 7, antibodies are listed in Table 2.

Table 2 Antibodies used in this study

Immunofluorescence (IF) staining

For IF, hiPSCs were seeded into Matrigel-coated 24-well µ-plates (Ibidi). Pluripotent or differentiated cells were fixed with 1% methanol-free formaldehyde in DPBS for 10 min at room temperature (RT), permeabilized with 0.2% Triton X-100 in DPBS and sequentially incubated with primary antibody [HA, NANOG, OCT4, or SOX2; 4 µg/ml in 2% bovine serum albumin (BSA; Sigma-Aldrich) and 0.2% Triton X-100 in DPBS], secondary goat anti-mouse-IgG-AlexaFluor-488 antibody (1:2000) and 2 µg/mL 4′,6-Diamidino-2-phenylindole (DAPI; Sigma-Aldrich). Wells were covered with one drop of mounting solution containing 10% Mowiol 4-88, 25% glycerol, and 2.5% 1,4-Diazabicyclo(2.2.2)octane (all from Sigma-Aldrich) and 0.17 mm thick 15 mm diameter glass coverslips (Marienfeld). Direct IF of fixed cells was performed similarly with TRA-1-60-AF488 antibody diluted 1:10 in 2% BSA, 0.2% Triton X-100 in DPBS. Triton X-100 was omitted for CD34-APC IF to preserve mKO2 fluorescence. Images were acquired by sequential scan on a Leica TCS SP8X confocal microscope equipped with a 405 nm diode for DAPI and a white light laser (490 nm excitation for AF488 or FITC, 550 nm for mKO2, and 650 nm for APC), an HC PL APO CS2 40x/1.10 water immersion objective and LAS X software. For live cell imaging, microphotographs were acquired for mKO2 and transmitted light at 550 nm excitation and 35 °C. Antibodies are listed in Table 2.

Flow cytometry

For multicolor cell surface antigen staining, cells were incubated for 30 min with an antibody cocktail (CD117-BV421, CD56-BV650, CD19-BV570, CD33-BV711, CD42b-BV750, CD45-FITC, CD3-PerCP, CD41-PerCP-Cy5.5, CD34-APC) in 1x Brilliant Stain Buffer, washed with DPBS and analyzed on a FACSymphony A5 cytometer with FACSDiva 9.1, and processed with FlowJo 10.10.0 (all from Becton Dickinson Biosciences), or FCS Express 7 (DeNovo) software. Automatic compensation was performed using single and unstained controls. CG long-term lines (mKO2+) were analyzed along with M-07e and MOLT-4, which served as CG-AMKL and CG-negative T-ALL controls, respectively (Supplementary Fig. 5h–j). For marker expression analysis upon dTAG-13 treatment, CG long-term cultured cells were incubated with dead cell stain Zombie UV (Biolegend; gating strategy in Supplementary Fig. 15a) followed by a cocktail of CD117-BV421, CD41-PerCP-Cy5.5, CD56-BV650 and CD42b-BV750 antibodies. Antibodies are listed in Table 2.

Gene expression analysis (RNA-seq)

Library preparation and sequencing

For bulk RNA-seq, total RNA was isolated from live cells using TRIzol reagent (Thermo Scientific). A total of 500 ng RNA were used for strand-specific library prep using the NEBnext Ultra II kit (New England Biolabs, E7760L) and sequenced on an Illumina NovaSeq (2 × 150 cycles paired-end setup) by the Next Generation Sequencing Facility at Vienna BioCenter Core Facilities (VBCF), member of the Vienna BioCenter (VBC). An overview of all datasets is provided in Supplementary Data 1.

Data preprocessing and normalization

Raw sequence data were processed using the nf-core/rnaseq pipeline v3.5 (https://doi.org/10.5281/zenodo.5789421)83 with default parameters, including alignment to the hg38 reference genome using STAR algorithm v2.6.1d84 and read aggregation with respect to gene annotations from gencode v3985. Sequencing quality metrics from FastQC were loaded into R using fastqcr v0.1.2. We then used Salmon v1.5.2 for transcriptome quantification86. Read counts were further generated using HTSeq version 0.11.0 in union mode87, ignoring secondary and supplementary alignments using genecode annotation v3188. The htseq counts were only used at the reference mapping step for consistency with patient data. All subsequent analyses were performed in R v.4.2.3. In both cases, we only included genes detected (≥3 reads) in at least three samples and of biotypes “protein_coding”, “lncRNA”, and “miRNA” based on Ensembl release-107 GRCh38 GTF file and excluded genes flagged in “blacklist_classification_20210408” (part of St Jude RNA-seq pipeline). Raw counts were then normalized using varianceStabilizingTransformation() function from DESeq2 v1.38.3.

AML/AMKL reference mapping

For reference mapping we proceeded as follows: first, we explored the quality and metadata table of each dataset separately to exclude outliers and low-quality samples (four outlier samples from the reference [SJAMLM7015543_D1, SJAMLM7015539_D1, SJAMLM7015534_D1, SJAMLM7057_D] and non-sorted samples)23. Then, principal component analysis (PCA) was applied using prcomp() function from stats v4.2.3 separately for each of the query models and patient reference datasets. Next, we selected the top-50 components and excluded the third component from downstream steps because of its strong association with the strandedness of samples in the reference dataset. After that, PCA components of query datasets were multiplied by the reference rotation, after applying the same rescaling (center and scale) of the reference dataset. Finally, the results of both the reference and projected query data were visualized using umap(ret_nn = TRUE, ret_model = TRUE) from from uwot v0.1.16.9000. Distances of query samples to reference samples were computed using dist(diag = FALSE, upper = FALSE) from stats v4.2.3 followed by scaling using scale(center = FALSE) from base v4.2.3.

pLSC6 stemness scoring

Stemness scoring was computed based on the gene coefficients from the pLSC6 model37. Out of the six genes, two had to been renamed with updated aliases available in our data (KIAA0125 = FAM30A, GPR56 = ADGRG1). For each sample, the stemness score was computed as the weighted sum of pLSC6 genes’ expressions.

Differential expression analysis and gene set enrichment analysis

Differential expression analysis was carried out using DESeq2 v1.38.3 per dataset using as a reference the control samples at earliest timepoint40. Differentially expressed genes (DEGs) were defined as genes with padj < 0.05 and absolute log2FoldChange ≥1. Lists of up- and down-regulated genes in AMKL patients were collected from two publications5,36. For the latter we expanded the regulated genesets (CG-AMKL compared to other AMLs) with a list kindly provided by Soheil Meshinchi (combination of Children’s Oncology Group cohorts AAML0531 and AAML1031)36. Then, gene set enrichment analysis (GSEA) was computed using Over Representation Analysis (ORA) approach implemented in hypeR(background = set of all genes available in our dataset) from hypeR v1.14.0 comparing DEGs from each group in each dataset to the AMKL patient genesets89. hypeR output was then visualized using scatterpie v0.1.7.

Transcription factor binding analysis (ChIP-seq)

Library preparation and sequencing

For ChIP-seq, approximately 107 CG-expressing long-term cultured or M-07e cells were fixed for 10 min at RT with 1% methanol-free formaldehyde in DPBS, and chromatin was prepared with the truChIP chromatin shearing kit according to NEXSON protocol90 by ultrasonication for 20 min at 75 W peak incidence power, 10% duty factor, 200 cycles per burst and 7 °C (Covaris M220). Immunoprecipitation was performed with chromatin of 3–4 × 106 cells, 3 µg HA, CBFA2T3, or control normal mouse immunoglobulin G (antibody list in Table 2), and Protein G dynabeads (Invitrogen). Libraries were prepared from input and immunoprecipitated DNA using the NEBnext Ultra II kit (New England Biolabs, E7645L) and sequenced on an Illumina NovaSeq (either 2 × 150 cycles in paired-end [CG_22_HA_IP_R1.1, CG_22_INPUT_R1.1, sub_M07E_CBFA2T3_IP, sub_M07E_INPUT] or all other samples in single-end 100 cycles setup) by the Vienna BioCenter Core Facilities (VBCF). An overview of all datasets is available in Supplementary Data 1.

Data preprocessing and counting

Raw sequence data were processed using the nf-core/chipseq pipeline v1.2.2 (https://doi.org/10.5281/zenodo.3966161)83 with default parameters. Reads were aligned to the hg38 reference genome using BWA-MEM v0.7.17-r1188 peaks enriched over input were called using the MACS2 v2.2.7.1 narrow peaks algorithm39,91. Next, we loaded the narrow peaks files into R v4.2.3 using read_narrowpeaks() from plyranges v1.18.0 and the associated statistics using read.delim(comment.char = “#”)92. We excluded peaks with MACS2 log10.qvalue < 2, that overlapped with the ENCODE blacklist problematic regions of the genome, or peaks that mapped to non-standard chromosomes. The set of consensus peaks were defined by merging overlapping peaks detected in at least two of four HA antibody replicates. We then used featureCounts(countMultiMappingReads = FALSE, nthreads = 5) from Rsubread v.2.12.3 in single-end mode except for one paired-end sample (CG_22_HA_IP_R1.1) to generate the ChIP-seq count matrix93. All ChIP-seq peaks are reported in Supplementary Data 3.

Variant identification

Variant calling was performed using the nf-core/sarek pipeline v3.3.2 (https://doi.org/10.5281/zenodo.8405436)83,94,95 in a tumor-normal matched mode for whole genome sequencing (WGS) data, whereas ChIP input samples were compared to published data from the parental hiPSC line96. Raw reads were aligned to the hg38 reference genome using BWA-MEM91 v0.7.17-r1188 and variants were called with GATK Mutect297 and Strelka298 using the pipeline’s default settings. Resulting VCF files were annotated using Ensembl Variant Effect Predictor (VEP) v11099 including dbNSFP v4.7a100,101 annotation with CADD v1.7102. Mutect2 results were further normalized using bcftools norm v1.3.1103. Annotated VCF files were imported and filtered using R v.4.2.3. The identified variants were filtered based on the default quality control measures implemented in each tool (FILTER column in the VCF contains “PASS”). In addition, the following filter criteria were applied: (1) keep variants with a population allele frequency of less than 0.1% (defined by the frequency of existing variants in gnomAD exomes or genomes combined population; variants with undefined frequency were not filtered as well), (2) keep variants with a variant allele frequency of at least 5%, (3) keep variants were the annotated consequence containing “coding_sequence_variant”, “frameshift_variant”, “incomplete_terminal_codon”, “inframe_deletion”, “inframe_insertion”, “missense_variant”, “protein_altering_variant”, “start_lost”, “stop_gained” or “stop_lost”. The filtered list of variants is reported in Supplementary Data 2.

Copy number calling

We used the ACE R package v1.20.0104, which is intended for low-coverage WGS data, and the ChIP-input samples to estimate absolute copy number in the edited long-term cell lines. BWA-aligned BAM files from the nf-core/chipseq results were used as input. runACE() was used with a bin-size of 100 kb and QDNAseq.hg38 annotation v1.1.0 (https://doi.org/10.5281/zenodo.4274556). A square model analysis was performed using the squaremodel() function including a penalty parameter of 0.5 penalizing fits at lower cellularities. We manually examined the solutions with smallest minimum absolute deviation score and selected one which best fits the expected ploidy = 2 and cellularity = 1. Adjusted segments were calculated and log2-transformed copy number ratio plots colored by copy number state were generated using ggplot2.

Chromatin accessibility analysis (ATAC-seq)

Library preparation and sequencing

For bulk ATAC-seq, nuclei were prepared from 100–250 k frozen cells by gentle lysis in 100 µl isolation buffer (10 µl of 1 M Tris-HCl pH 7.5, 2 µl of 5 M NaCl, 3 µl of 1 M MgCl2, 1% BSA, 0.1% Tween-20, 0.05% NP40, 0.005% Digitonin, 1 µl of 1 M DTT, 25 µl of 40 U/µl RNase inhibitor in 843 µl of water) for 5 min on ice. Nuclei were washed with 400 µl of wash buffer (250 µl of 1 M Tris-HCl pH 7.5, 50 µl of 5 M NaCl, 75 µl of 1 M MgCl2, 1% BSA, 0.1% Tween-20, 25 µl of 1 M DTT, 625 µl of 40 U/µl RNase inhibitor in 20.925 ml of water), strained (40 µm), and centrifuged for 5 min (500 × g, 4 °C). Nuclei were resuspended in nuclei buffer (50 µl of 20x nuclei buffer, 1 µl of 1 M DTT, 1% BSA, 824 µl water). Nuclei isolation quality was assessed using concentration, viability, and aggregates measured with nucleocounter. Tagmentation was performed for 50,000 targeted nuclei using the Illumina Tagment DNA TDE1 Enzyme and Buffer Kits #15027865. Post-tagmentation DNA fragment recovery was performed using a Guanidium-based method.

Briefly, 25 µl of 8 M guanidinium hydrochloride solution pH 8.5 (MBS) was added for 5 min, followed by 120 µl of binding solution (BS: 20% PEG 8000, 2.5 M NaCl, 10 mM Tris-HCl pH 8, 1 mM EDTA, 0.05% Tween-20) and 30 µl of MBSpure beads. After 10 min of incubation, fragments were purified on a magnetic stand and washed twice with 150 µl of 80% EtOH. After 2 min of air dry, beads were resuspended in 13 µl of EB buffer and magnetized. The eluted solution was transferred to a fresh tube for library amplification [12.5 µl of the purified DNA was mixed with 10 µl of 2.5 µM i5/i7 Nextera index primer mix, 2.5 µl of Evagreen and 25 µl of 2x PCR Mastermix Q5 (Non-Hot Start, NEB # M0492L)]. After library amplification, fragments corresponding to mono- and di-nucleosomes were enriched using a double-sided DNA clean up with MBSpure beads. DNA was quantified on a Qubit and the library size distribution of the DNA fragments was finally assessed using a Fragment Analyzer. Libraries were sequenced on an Illumina NovaSeq (2 × 150 cycles paired-end setup) by the Next Generation Sequencing Facility at Vienna BioCenter Core Facilities (VBCF). An overview of all datasets is available in Supplementary Data 1.

Data analysis

Raw sequencing data were processed using the nf-core/atacseq pipeline v1.2.1 (https://doi.org/10.5281/zenodo.3965985)83 using default settings. Reads were aligned to the hg38 reference genome using BWA-MEM91 v0.7.17-r1188 and peak calling was done using MACS2 v2.2.7.139.

Reading peaks and generating counts

We loaded the broad peaks files into R v.4.2.3 using read.delim(comment.char = “#”) from utilis v4.2.3 and applied the same exclusion criteria as for ChIP-seq peaks (excluding peaks with log10.qvalue < 2, overlap with the ENCODE blacklist regions, or mapped to non-standard chromosomes). We defined the pairwise similarity among ATAC-seq samples as the ratio of the number of overlapped peaks relative to the number of peaks in the in the smaller sample (i.e., fewer peaks). This resulted in a symmetric similarity matrix that was then passed to hclust(method = “average”) from stats v4.2.3 for hierarchical clustering and visualized using ComplexHeatmap v2.15.4. The set of consensuses ATAC-seq regions was defined by merging overlapping peaks detected in at least three samples. We then used featureCounts(isPairedEnd = TRUE, countMultiMappingReads = FALSE) from Rsubread v.2.12.3 to generate the ATAC-seq count matrix per region followed by variance-stabilizing transformation using DEseq2 v1.38.340. We used the normalized matrix to run PCA using prcomp(retx = TRUE, center = TRUE, scale. = FALSE). We used the same reference mapping approach explained in the RNA-seq section to integrate external M-07e and AMKL7 samples as a query with our data as a reference15. The ATAC-seq peaks are reported in Supplementary Data 3.

Differential accessibility analysis and chromatin modules

Then, we used DEseq2 v1.38.3 for differential accessibility analysis between CG and control samples at each time point separately. In case of lack of controls at a timepoint we used controls from the previous time point. We passed region counts, genomic coordinates, and sample metadata to DESeqDataSetFromMatrix() followed by DESeq2() to run the differential count analysis on the Negative Binomial distribution. Finally, we retrieved the results in the form of a Granges object using results(format = “GRanges”) to get analysis statistics in the form of a GRanges object for each comparison. Regions with both padj < 0.001 and absolute |log2FoldChange | ≥ 1 were defined as differentially accessible regions (DARs). The set of DARs defined at each timepoint were further split based on the accessibility state with respect to CG samples indicated by the log2FoldChange (open = log2FoldChange ≥ 1, closed = log2FoldChange ≤ −1). This resulted in six sets of DARs, two (open/closed) at each time point (Day 13, Day 25, Long-term). To visualize the size and overlap between these sets we used Upset plot functionalities from ComplexHeatmap105,106. To do that, we used make_comb_mat() and UpSet() to generate a combination of all sets of at least 500 regions and visualization, respectively. DARs overlapping ChIP-seq peaks were selected and used in downstream analyses. The results of the differential accessibility analysis are reported in Supplementary Data 4.

Region annotation by genomic feature

We used ChIPseeker107,108 v.1.34.1 to annotate regions with respect to their distance to known genes and biomaRt109 v.2.54.1 to access Ensembl Human Regulatory Features using useEnsembl(biomart = regulation”, dataset = “hsapiens_regulatory_feature”) and retrieve identified enhancers in K562 cell line using getBM(filters = c(“chromosomal_region”, “epigenome_name”), values = list(epigenome_name = “K562”), mart = ensembl). We adapted the following annotation priority: promoters (transcription start site ±1000 bp), enhancers (Encode K562), intragenic (exon, intron, UTR), intergenic (≥300 bp downstream gene end).

Motif enrichment analysis

The motif enrichment workflow was implemented following the function FindMotifs() from Signac v1.11.0110. We started by downloading a list of position frequency matrices (PFM) from JASPAR database41,111 v2024 (CORE_vertebrates_non-redundant). Motifs with multiple entries were made unique by appending a number. Next, we obtained motif matching results across the whole set of regions using matchMotifs(genome = BSgenome.Hsapiens.UCSC.hg38.masked, out = ‘matches’) and motifMatches(), both from motifmatchr v1.20.0 Then, for each set of query regions, we used RegionStats() and MatchRegionStats() to generate a matching set of background regions (n = 40,000) based on genomic statistics (GC content, region lengths, and dinucleotide base frequencies). Finally, we used phyper(lower.tail = FALSE) from stats v4.2.3 to run right-sided hypergeometric test. We adjusted p-value for multiple testing using p.adjust(method = “BH”) and calculated fold-enrichment as the quotient of observed matchings in the query divided by matchings in background. Motif enrichment results are available in Supplementary Data 4.

Chromatin coverage signal

Coverage signals at DARs were analyzed profileplyr v1.14.1 starting from the BAM files using BamBigwig_to_chipProfile (format = “bam”, style = “percentOfRegion”, nOfWindows = 50, distanceAround = 100). For each region, we averaged, scaled, and smoothed the signal per group and highlighted the regions overlapping ChIP-seq peaks. Finally, the regions were arranged in descending order based on the average ATAC-seq signal and visualized using EnrichedHeatmap v1.28.1112. ATAC-seq and ChIP-seq genome tracks were visualized using Gviz v1.46.1113.

Single-cell analysis (scRNA-seq and scRNA/ATAC-seq [multi-ome])

Library preparation and sequencing

Single-cell suspensions were processed using the 10x Genomics Single Cell 3’ v3.1 workflow following the manufacturer’s instructions. For the 2020 batch, fresh cells were processed directly. For the 2022 batch, cells were cryopreserved in CryoStor CS10. Vials were thawed in a water bath at 37 °C and warm DMEM was added dropwise. Cells were washed once with DMEM and resuspended in PBS + 10% FBS. Cells were counted and biological replicates were barcoded with distinct CELLPLEX labels and the same number of cells from each sample was pooled. The same frozen samples were processed for multi-omic scRNA/ATAC-seq analysis using the 10x Genomics Single Cell Multiome ATAC + Gene Expression v1 workflow. After quality control using Qubit (Invitrogen) and TapeStation (Agilent), libraries were sequenced on the Illumina NovaSeq platform in 2 × 50 bp (scRNA-seq) and 2 × 150 bp (scRNA/ATAC-seq) paired-end mode. Supplementary Data 1 provides an overview of sequencing data and performance metrics.

Raw data preprocessing

We used CellRanger v7.0.0 (10x Genomics, parameters: CHEM = SC3Pv3 –nosecondary) for analysis of the scRNA-seq and CellRanger-ARC v2.0.2 (10x Genomics, default parameters) for the analysis of scRNA/ATAC-seq data, including alignment to GRCh38-2020-A human reference transcriptome and cell demultiplexing.

Count data processing and annotation (scRNA-seq)

The R statistics software v4.2.3 was used to carry out the entire analysis workflow. We loaded the counts, metadata, and demultiplexing data into a Seurat object using Read10X() and CreateSeuratObject() from Seurat v4.2.051,114,115. Cell cycle phase was inferred based on expression of listed G2M and S phase markers and CellCycleScoring() both available in Seurat. Quality control was applied both on levels of genes and cells. For the former, like bulk RNA-seq data, we only included genes with biotype labeled as “protein_coding”, “lncRNA” or “miRNA” and detected in at least 20 cells. For the latter, we included cells that passed all the following criteria: percent.mt ≤ 20, nFeature_RNA ≥ 500, nFeature_RNA ≤ = 4000, percent.rp ≥ 5, percent.rp ≤ 25, neither “Doublet” nor “Negative”. Next, we normalized the counts using SCTransform(do.correct.umi = FALSE)116, generated a PCA dimensionality reduction using RunPCA() and integrated the batches using RunHarmony() from Harmony v1.2.053. To define clusters we started by generating sNN graph based on the top-15 components (the 8th component was excluded due to association with batch effect) of the SCT-normalized and Harmony-integrated embeddings using FindNeighbors(reduction = “harmony.SCT”, k.param = 100, compute.SNN = TRUE), followed by FindClusters(resolution = 0.3, group.singletons = FALSE). We further subclustered the central branching cluster using FindSubCluster() from Seurat. To save the Neighbor object in the Seurat object, we used FindNeighbors() again as shown above, but this time setting return.neighbor = TRUE. After that, we generated a UMAP dimensionality reduction based on the Neighbor object using RunUMAP(n.neighbors = 30, nn.name = name of Neighbor object, reduction.name = “umap.nn”, return.model = TRUE) from Seurat. Trajectory inference was carried out using slingshot52 v2.6.0 on the SCT normalized and Harmony-integrated embeddings. For the control subset, we used the top 5 components of the SCT normalized and Harmony-integrated embeddings to model the multi-lineage differentiation, used “scaled.diag” as a distance method, sat extend parameter to “n”, and did not allow for breaks between lineages to get a single connected minimum spanning tree. The earliest common cluster (“HSPC”) was defined as the origin cluster, and the latest stage of each lineage, namely, “MK_late”, “Ery_late”, “Mac”, “Baso/Eosino”, and “Mast” as end clusters. The underlying minimum spanning tree was visualized using ggraph v2.0.5.

Count data processing and annotation (scRNA/ATAC-seq)

Data from each modality was loaded separately and merged into a single Seurat object as two different assays and we used the Seurat51,91,114,115 v.4.2.0 and Signac110 v.1.11.0 workflow for the downstream analysis of each modality, respectively. For each sample, scRNA-seq count data were loaded into R using Read10X(), while the called scATAC-seq peaks were merged by overlaps with peaks from bulk ATAC-seq into a set of non-overlapping consensus peaks (width between 143 bp and 7618 bp). Then, we loaded the scATAC-seq fragment files and generated the scATAC-seq count matrix using CreateFragmentObject(validate.fragments = TRUE) and FeatureMatrix(), respectively. Finally, the Seurat objects from different samples were merged into a single object that was used for downstream analysis. We started by applying QC filtration criteria (scRNA-seq: nCount_RNA ≤ 10,000 & nCount_RNA ≥ 1500 & percent.mt <= 25 & percent.rp <= 20; scATAC-seq: nCount_peaks >= 2000 & nCount_peaks <= 10,000 & pct_reads_in_peaks >= 70 & blacklist_ratio <= 0.025 & nucleosome_signal <= 2 & TSS.enrichment >= 3 & TSS.enrichment <= 6 & pct_mitochondrial <= 0.1).

Then, for data normalization and dimensionality reduction of the scRNA-seq modality we used SCTransform116 and PCA followed by UMAP. For scATAC-seq we applied TF-IDF data normalization using RunTFIDF() and selected the top features using FindTopFeatures() to apply partial Singular Value Decomposition (SVD) using RunSVD(), followed by generating UMAP on singular values 2-15 using RunUMAP(). To focus more on the MK-like subset of the data, we adapted the same cell projection approach used to map the CG scRNA-seq subset onto the control scRNA-seq reference (see “Cell projection and label transfer”) with the scRNA/ATAC-seq data by leveraging the scRNA-seq modality as a bridge and selected cells annotated as “MK/Ery_pre”, “MK/Ery_early”, or “MK_late”. Next, we transferred the cluster labels (s0, ctr1, cg1-4) defined on the Mk-like scRNA-seq trajectory so that we could refine peak calling at cluster level instead of the whole sample. To do that, we used Herper v.1.8.1 to install MACS2 to a conda environment and then used CallPeaks(combine.peaks = FALSE) grouped by the transferred MK-like cluster labels. Identified peaks from each cluster were QC-filtered (neg_log10qvalue_summit >= 3) and then merged with the peaks from the ATAC-seq data.

Cell projection and label transfer

We used Seurat51 v.4.2.0 reference mapping approach to project query cells onto reference data. The mapping workflow starts with finding the set of anchors between reference and query datasets using FindTransferAnchors(normalization.method = “SCT”, reference.assay = “SCT”, query.assay = “SCT”, reduction = “pcaproject”, reference.reduction = “pca.SCT”, dims = 1:15, mapping.score.k = 100) and added evaluated the representation of query cells in reference using MappingScore(ndim = 15). Next, the pre-computed anchors were utilized for datasets integration using IntegrateEmbeddings(reductions = “pcaproject”). Finally, query cells are projected onto the reference UMAP model using ProjectUMAP(reference.reduction = “pca.SCT”). We leveraged the annotated control subset, the projected CG subset, the M-07e sample, and external datasets47,48,49. We then transferred both the discrete cluster annotations and continuous pseudotime scores from the control reference to the query datasets.

scRNA-seq velocity

We defined the MK subpopulation in our data as controls and CG cells annotated as “MK/Ery_pre”, “MK/Ery_early”, or “MK_late”. This subset was reprocessed following the same workflow stated above: normalization, dimensionality reduction, integration, and clustering. We then carried out scRNA-seq velocity analysis as follows: first, we installed velocyto117 v.0.17.17 to a conda environment with Python 3.9 and used it to generate the spliced and unspliced matrices. Next, we read the output loom files in R using velocyto.R v.0.6 read.loom.matrices() and merged the splicing information with expression, clustering, and dimensionality reduction information into an Anndata object using anndata v.0.9.2. Python packages were imported to R via reticulate v.1.24. Next, we used neighbors() from scanpy v.1.9.8 to define the kNN graph based on the top 30 Harmony-integrated components53 and passed it to moments() from scVelo54 v.0.3.2 to compute first/second-order moments from each cell based on its nearest neighbors. Finally, we used velocity(mode = “stochastic”) to use a stochastic model to infer velocities and velocity_embedding_stream(add_margin = 0.1, min_mass = 0, cutoff_perc = 0) for overlaying a stream plot of velocities on top of Harmony53 embeddings.

Gene regulatory networks inference

For GRN inference, we used SCENIC60 v1.3.1 to identify cluster specific regulons. As a reference we used human cisTarget v10 database collections linking gene to motifs up to 10 kb from the TSS in both directions. As input, we filtered the raw read counts from our data using geneFiltering(minCountsPerGene = 3*.01*#cells, minSamples = #cells*.01) followed by log2(counts+1) normalization. Then, we computed the spearman correlation using runCorrelation(), passed it to grnboost2() from arboreto118 v.0.1.6 to compute co-expression matrix, and used runSCENIC_1_coexNetwork2modules(weightCol = “importance”) to obtain the co-expression modules. Next, the co-expression modules were further refined using runSCENIC_2_createRegulons() based on TF-motif enrichment analysis and then used to compute regulon activity AUCell scores using runSCENIC_3_scoreCells(). Finally, cluster-specific regulons were identified using calcRSS()119. All regulon scores are available in Supplementary Data 6.

Differentiation potential inference

To quantify the differentiation potential of different clusters we used SCENT55 v.1.0.3 to compute signaling entropy of gene expression based on human protein-protein interaction (PPI) network. To start with, we utilized the interaction network (net17Jan16.m) provided with SCENT and converted the entrez IDs to gene symbols using mapIds(column = “SYMBOL”, keytype = “ENTREZID”, multiVals = “first”) from AnnotationDbi v.1.60.2. Then, we computed the correlation between the SCT normalized expression and PPI using CompCCAT() and scaled the output by maximum value.

Mk-like modules detection

DEGs were defined as the set of clusters’ markers identified by DElegate56 v.1.1.0 FindAllMarkers2(min_fc = -Inf, min_rate = 0) with padj < 0.001 and absolute log_fc > =1 at a detection rate of at least 0.1. To capture the main expression patterns of DEGs and preserve the cell variability, we first generated a balanced representation of each cluster as the averages of pseudo-replicates (n = 100) SCTransform-normalized expression. To reduce the effect of extreme values, we capped the maximum value to the 99th quantile and scaled the data by the maximum value. Then, we computed the distance matrix based on cosine correlation using corDist(method = “cosine”) from MKmisc v.1.8 and used hclust(method = “ward.D2”) to define hierarchical clustering tree that was subsequently cut into 10 DEGs modules (G1-10). We then looked at the enrichment of CG-binding sites around the TSS of DEGs from each module to define candidate direct targets. To do that, we utilized great(tss_source = “txdb:hg38”) from rGREAT120 v.2.0.2 using as background all peaks from scATAC-seq, DEGs modules as genesets source of TSS, and ATAC-seq peak modules overlapping CG-binding sites as regions of interest. All DEGs are listed in Supplementary Data 6.

Module scores of genes and peaks

We used AddModuleScore(search = TRUE) from Seurat to compute the activity of the CG+ patients genesets defined in Gruber 2012 across the cell of the scRNA-seq megakaryocyte subset. The results were visualized as a scatter plot with 2D density contours where the color and transparency were scaled by the density of points and presented separately for each cluster.

To compute the accessibility scores for the set of peaks defined in the ChIP-seq analysis in the cells from the scATAC-seq assay, we start by defining a new subset of scATAC-seq regions that overlaps with the ChIP-seq peaks. Then, we used AddChromatinModule(genome= BSgenome.Hsapiens.UCSC.hg38.masked) from Signac to calculate the chromVAR deviation for these regions. The resulting deviation values were restricted between the 5th and 95th quantiles and visualized using ggplot and scale_color_scico(palette = “vik”,midpoint = 0) from scico (https://github.com/thomasp85/scico) to fix the midpoint.

Defining the point of fate divergence

We observed that unlike terminal differentiation states where cells are similar and part of the same cluster, the point of divergence is characterized by a heterogeneous neighborhood located at the interface of different clusters. We leveraged this observation to define the point of divergence based on neighborhood heterogeneity. First, we used MixingMetric(dims = 1:2, k = 5, max.k = 1000) from Seurat to quantify the cluster purity of the neighborhoods surrounding day 13 megakaryocytes in the top two coordinates of the Harmony space. Then, we smoothed the purity scores across neighboring cells in the Harmony space using SmoothKNN(k = 500) from UCell121 and selected the top set of cells based on the distribution of smoothed values (=<990). This subset of cells captures the state of divergence and early state of commitment to one of the fates that we characterized in this study.

To unravel the early regulatory perturbations and their influence on the destined cell fate, we compared cells committing to aMK and aMKP fates to their control counterparts. To do that, we first assigned the selected early-stage cells to one of the fates based on the cluster that they are part of: ctrl1 to MK, cg1 to aMK, and s0 to aMKP, and labeled these newly defined cell sets as ctrl1D13, cg1D13, and s0D13, respectively. Next, we used the regulon activity score inferred by SCENIC to calculate the regulon specificity (RSS) for each fate (see “Gene regulatory networks inference” section above). Finally, to quantify the regulatory perturbations in s0D13 and cg1D13 with respect to ctrl1D13, we divided the RSS scores by the corresponding score in ctrl1D13 and used the log2 of the absolute values to quantify the deviation of regulatory activity of CG fates from control.

scRNA/ATAC-seq loci plots

We used Signac to plot genomic coverage per cell cluster at selected loci of interest using CoveragePlot(peaks = FALSE, annotation = FALSE, extend.downstream = 0, extend.upstream = 0, downsample.rate = 1, links = FALSE, window = 100). Gene annotation and expression values of selected genes were generated using AnnotationPlot() and DotPlot(assay = “SCT”, slot = “data”), respectively. Then, we used LinkPeaks(peak.assay = “peaks”, expression.assay = “SCT”, verbose = FALSE, method = “pearson”) to find peaks whose accessibility is correlated with the expression pattern of differentially expressed genes and plotted the peak-gene connections using LinkPlot(). Finally, we used the ChIP-Atlas122,123,124 database to find TF binding sites at the loci of interest in blood- and leukemia-related cell types (“K-562”, “Acute_myeloid_leukemia”, “Leukemia_cell_line”, and “LeukemiaKOMMA_Myeloid”). We used all binding sites at threshold of 50 (as defined by the web portal; data accessed on July 11, 2024). Additionally, we highlighted the CG-binding sites by overlapping the ChIP-seq peaks with “open” and “closed” ATAC-seq modules (see “Differential accessibility analysis and chromatin modules” section and Fig. 5c). The results were plotted using PeakPlot(), where TF binding sites are shown as horizontal black bands at the bottom and CG-binding sites as vertical transparent red bands.

Motif activity and footprinting

We used the Position Weight Matrix (PWM) of the same JASPAR motifs used in the bulk ATAC-seq analysis (see “Motif enrichment analysis” section above) as input to AddMotifs(genome = BSgenome.Hsapiens.UCSC.hg38.masked) from Signac to construct a matrix of motif matching sites and add it as a Motif object to the Seurat object. Then, we defined a set of differentially accessible peaks using DElegate56 v.1.1.0 and selected peaks with padj<0.001 and absolute log_fc >= 1 to compute chromVAR motif activities with RunChromVAR(). We visualized chromVAR deviation scores of motifs over the projected trajectory as a proxy of the TF activity.

We performed motif footprinting grouped by the different clusters using Footprint(verbose = TRUE, genome = BSgenome.Hsapiens.UCSC.hg38.masked) from Signac. We visualized the results using a modified version of PlotFootprint() to allow cluster-wise smoothing of the footprinting signal. We used the moving-average smoothing implemented in centerRollMean(smooth.window = 5) fom ArchR v.1.0.2125.

DepMap analysis and visualization

We downloaded the DepMap 22Q2 Public dataset of Gene Effects (i.e., “efficacy scores”; https://doi.org/10.6084/m9.figshare.19700056.v2) derived from CRISPR knockout screens by Broad’s Achilles and Sanger’s SCORE126,127,128,129 and inferred by CHRONOS128 from the DepMap portal. To obtain cell-line-specific “selectivity scores”, the efficacy scores for each gene were scaled by subtracting the mean value across all cell lines. We then focused on the M-07e cell line (DepMap ID: ACH-000602), generated a scatter plot of efficacy vs. selectivity, and highlighted the top M-07e-selective genes.

Statistics and reproducibility

Statistical analyses of cell culture-related assays were performed using GraphPad Prism 10.0.2 software. Statistical significance was calculated by one-way ANOVA with Dunnett’s post hoc test with p ≤ 0.05 considered statistically significant as indicated in the figure legends. All significant p values are reported in the figures or corresponding figure legends. No statistical methods were used to predetermine sample sizes. We chose sample sizes and assumed Gaussian distribution and homogeneity of variance as commonly reported for similar experiments in previous publications. Independent experiments were analyzed together whenever possible, and the numbers of independent replicates are indicated in the figures or legends. For MethoCult and MegaCult clonogenicity assay experiments, no outliers were excluded from the data. For MegaCult assays, CFU numbers and types of each sample were determined by two different individuals and averaged. Following samples were excluded because of insufficient data quality: for RNA-seq, 4 outlier samples from the St. Jude reference (SJAMLM7015543_D1, SJAMLM7015539_D1, SJAMLM7015534_D1, SJAMLM7057_D)35 and not GFP+ sorted samples23; for ChIP-seq, 1 of 5 replicates for CG-dTAG-HA in CG long-term cells; for ATAC-seq, 1 replicate of 3 for CG +dTAG-13 on D25; and for scRNA/ATAC-seq, the RNA expression part for WT ±dTAG-13 on D25 and CG +dTAG-13 on D25 samples.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Continue Reading