Deciphering the sequence basis and application of transcriptional initiation regulation in plant genomes through deep learning

Plant growth conditions and material collection

Wild soybean (Glycine soja PI468916) and cultivated soybean (G. max cv. Williams 82) were grown in a greenhouse under a 16 h light/8 h dark photoperiod at 25 ± 2 °C. Plants were maintained on a vermiculite:soil mix (3:1) and watered as needed. For nodulation assays, seedlings were inoculated with Bradyrhizobium diazoefficiens USDA 110 and grown for 20 days post-inoculation (dpi) alongside uninoculated controls, following the protocol of of Ren et al. [23]. Eight tissues were harvested at the development stages of trefoil, flowering, and seed development. Cotton (Gossypium hirsutum TM 1), rapeseed (Brassica napus ZS11), wheat (Triticum durum ‘Svevo’), tomato (Solanum lycopersicum LA1589), rice (Oryza sativa ZS97), and maize (Zea mays B73) were grown under the same greenhouse conditions. Leaf tissues were harvested at young developmental stages. For each species and tissue, samples were collected from at least eight individual plants, pooled, flash-frozen in liquid nitrogen, and stored at − 80 °C until RNA isolation.

RNA isolation and rRNA depletion

Total RNA was extracted from each tissue using TRIzol reagent (Invitrogen/Life Technologies, CA) following the manufacturer’s protocol. To eliminate genomic DNA contamination, RNA samples were treated with the TURBO DNA-free Kit (Invitrogen). For nodule tissue of PI468916, rRNA was removed in two steps: first with the NEBNext rRNA Depletion Kit (Bacteria) (New England Biolabs), then with the RiboMinus Plant Kit for RNA-Seq (Invitrogen). RNA from the remaining seven tissues underwent rRNA depletion using only the RiboMinus Plant Kit for RNA-Seq.

Modified STRIPE-seq library construction and sequencing

To enrich for capped transcripts, total RNA was subjected to mRNA capture using VAHTS mRNA Capture Beads 2.0 (N403, Vazyme). The captured mRNA was then incubated with XRN-1 (R7040M, Beyotime) for 1 h to digest non-capped RNAs. Template-switching reverse transcription (TSRT) (2,964,768, invitrogen) was then performed using a unique barcoded reverse transcription oligonucleotide (RTO) for each sample. A 5′-biotin modified, unique molecular identifier-containing template-switching oligonucleotide (TSO) with three 3′ riboguanosines was added to capture the 5′ end. VAHTS DNA Clean Beads (N411, Vazyme) cleanup is then performed to remove excess TSO and RTO. The resulting products were amplified by PCR using TruSeq adapter-compatible primers. Depending on the input RNA amount, 16 to 20 cycles of PCR were carried out with the forward library oligo (FLO) and reverse library oligo (RLO). Solid-phase reversible immobilization bead (N411, Vazyme) cleanup was used to size-select fragments between 250 and 750 base pairs. Final libraries (25–100 ng) were sequenced on an Illumina NovaSeq PE150 platform to produce 150-base-pair paired-end reads at Novogene Corporation Inc. (Tianjin, China). Primer sequences for the RTO, TSO, and library PCR primers are listed in Additional file 2: Table S13.

Processing of STRIPE-seq data

Raw sequencing read quality was assessed with FastQC version 0.11.9 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Only Read 1, which corresponds to the sense strand of RNA, was used for downstream analysis. Adapter and poly-G tails were removed with cutadapt version 2.5 [24]. Reads shorter than 50 nucleotides or lacking the TATAGGG motif were discarded. PCR duplicates were collapsed using the UMI information via the fastx_toolkit function fasx_collapser version 0.0.14 (http://hannonlab.cshl.edu/fastx_toolkit). The TATAGGG motif was then trimmed from the remaining reads with cutadapt. Cleaned reads were first aligned to rRNA sequences using HISAT2 with default settings [25]. Unmapped reads from this step were retained and then aligned to the reference genome sequences of each species with HISAT2 (Additional file 2: Table S1). Only alignments with mapping quality greater than 30 were kept by filtering with SAMtools version 1.8 [26]. For nodule samples, additional filtering removed any reads that mapped to the Bradyrhizobium diazoefficiens USDA 110 genome. Transcription start site regions (TSRs) were then identified, quantified, and annotated following the pipeline described in Wang et al. [10], and all custom scripts are available on our GitHub repository (https://github.com/Wanjie-Feng/STRIPE_seq).

Data preprocessing of GenoRetriever

We selected a 4-kb window (− 3500 bp upstream, + 500 bp downstream of the dominant TSS) to focus on cis-regulatory determinants while avoiding the confounding influence of distal trans-acting regions. Empirical surveys of plant promoters show that > 95% of core and proximal elements, including TATA boxes, initiators, and most enhancer-promoter contacts, reside within this span, whereas sequences further upstream frequently belong to neighbouring genes or distal enhancers. Expanding the window much beyond 4 kb would therefore add noise and greatly increase the computational cost of training and interpreting the large-kernel convolutional network; contracting it would omit bona-fide regulatory motifs. Because Our largest convolutional kernel is 601 bp, we padded each end of the 4 kb analytical window with an additional 325 bp that is ignored during loss computation. This masked “buffer” provides complete upstream and downstream context for every base inside the training region and eliminates edge artefacts that arise when convolutional filters overlap sequence boundaries, analogous to preventing tokenisation errors in natural-language models.

For the single-tissue GenoRetriever model, transcription start sites (TSS) with counts per million (CPM) greater than one were selected as training examples. In the multi-tissue combined model, each TSS abundance was defined as the sum of CPM values across all tissues. To minimize false positives, only those TSS that exceeded a CPM of one in at least two tissues were retained for training. Before input into the model, all CPM values were transformed using a log base two scale according to the formula

$$begin{array}{c}target= {log}_{2}left(x+1right)end{array}$$

(1)

where x is the raw CPM.

After log transformation, the signals were smoothed using a one-dimensional convolution layer with a window size of five. This smoothing step reduces differences in apparent peak width between TSS and removes positional noise arising from experimental variation. For validation in other crop species, TSS were ranked by relative expression and the top 60% were used as training data. This selection excludes low-abundance sites that may represent experimental noise. To minimize sequence redundancy that can arise from genome-wide random sampling, particularly the presence of highly similar upstream and downstream regions across nearby TSSs, we adopted a chromosome-based masking strategy, in which all TSSs on two randomly selected chromosomes (Chr8 and Chr10) were reserved exclusively for testing. This approach helps to prevent potential data leakage and artificially inflated test performance that might result from overlapping genomic contexts between training and test sets. Furthermore, this design demonstrates that transcription start site regulation lacks chromosome-scale specificity, thereby reinforcing the generalizability of our model across the genome.

Motif consensus network

The motif consensus network consists of two convolution layers. Considering the average motif length and common motif-mining window sizes, the first convolution layer uses a kernel size of 51 to sample sequence features (Supplemtnary Fig. 2, Consensus network Stage 1). To preserve the raw features extracted by this layer, we apply only a Sigmoid activation function and omit batch normalization. The second convolution layer uses a kernel size of 601 to enhance feature extraction depth. After this layer, we apply batch normalization followed by a Softplus activation function to accelerate model convergence. The supplementary effect consensus network employs a two-branch, two-layer architecture (Supplemtnary Fig. 2, Consensus network Stage 2). The first branch incorporates the distilled motif consensus patterns and their corresponding feature curves. During training, this branch’s convolution weights remain fixed; only the bias terms are updated. The second branch mirrors the motif consensus network structure with two convolution layers, but uses smaller kernels: size 3 in the feature extraction layer and size 15 in the feature encoding layer. All weights in this branch are trainable. We use Kullback–Leibler divergence as the primary loss function and include a PseudoPoissonKL regularization term with a scaling factor of 1 × 10⁻³ to balance trend extraction and abundance fidelity [27]. To improve the visual clarity of the learned feature curves, we apply L2 regularization to the weights of the second convolution layer in each consensus network with a scale factor of 2 × 10⁻³. To prevent attention bias in downstream prediction, we apply L1 regularization to all trainable convolution weights: a scale factor of 4 × 10⁻⁵ for the first encoding layer and 5 × 10⁻⁵ for the second convolution layer.

$$begin{array}{c}KL Loss= sum_{i}{target}_{i}^{*}*left[text{ln}left({target}_{i}^{*}+ varepsilon right)-text{ln}left({pred}_{i}^{*}+varepsilon right)right] where {x}_{i}^{*}=frac{{x}_{i}}{sum_{i}{x}_{i}}end{array}$$

(2)

$$begin{array}{c}PseudoPoissonKL=sum_{i}{target}_{i}*text{ln}[frac{{target}_{i}+varepsilon }{{pred}_{i}+varepsilon }]+{target}_{i}-{pred}_{i}end{array}$$

(3)

Prediction network

The prediction network builds on the consensus networks to focus on accurate forecasting of transcription initiation signals. Its first layer contains two convolution modules whose weights are initialized from the motif consensus network and the supplementary effect consensus network, respectively, via knowledge distillation. During model training, the weight parameters of these two convolution modules remain fixed; only their bias terms are updated to allow fine adjustments. Following these fixed convolution layers, the network employs a U-net-style encoder-decoder architecture in which pooling layers are replaced by strided convolutions and nearest-neighbor upsampling is replaced by linear interpolation. Skip connections link corresponding encoder and decoder layers, facilitating the fusion of features at different scales. Within each encoder block, a feature extraction module first applies a multi-head efficient channel attention mechanism to reweight channel features, and then uses a bottleneck-style feedforward structure. The bottleneck feedforward structure expands and compresses feature dimensions through linear layers and integrates a convolutional hidden layer. Outputs from the attention mechanism and the feedforward structure are merged via residual connections to ensure stable gradient flow and capture both local and global sequence features. Two such U-net modules are cascaded, and a SoftPlus activation function at the end produces the predicted transcription initiation signal. During training, the model uses an L1-PseudoPoissonKL loss function to balance sparsity and count distribution realism. A cosine annealing learning rate schedule accelerates model fitting, prevents convergence to local minima, and enhances generalization performance.

$$begin{array}{c}L1 PseudoPoissonKL=sum_{i}{target}_{i}*text{ln}[frac{{target}_{i}+varepsilon }{{pred}_{i}+varepsilon }]+left|{target}_{i}-{pred}_{i}right|end{array}$$

(4)

Multi-head efficient channel attention (MECA)

The prediction network relies on motifs and supplementary sequence patterns distilled by the consensus networks, so its performance critically depends on accurately weighting these distinct feature channels. Traditional multi-head attention mechanisms emphasize positional relationships and are well suited to natural language, where tokens carry semantic roles. However, DNA sequences lack such clear “word” boundaries, and positional attention can introduce noise rather than clarity. To address this, we designed a channel-focused attention mechanism called Multi-head Efficient Channel Attention (MECA). MECA operates entirely along the channel dimension, ignoring sequence position to avoid misassigned attention. First, an adaptive average-pooling layer condenses each channel’s feature map into a single descriptor, capturing its global importance with minimal added parameters. The pooled descriptors are then split into multiple heads, each representing a distinct group of channels. Within each head, a lightweight linear projection network computes attention scores for its channels, enabling the model to learn diverse patterns of channel importance. Finally, the original feature maps are rescaled by these per-channel attention weights, enhancing the representation of critical sequence patterns while suppressing irrelevant noise. This approach delivers efficient and accurate channel weighting, boosting overall transcription initiation prediction performance.

Artificial knowledge distillation

To identify robust sequence patterns across tissues and varieties, we applied an artificial knowledge distillation procedure based on convolutional kernel correlations. In the motif extraction stage, fifteen independent motif consensus models were trained separately on STRIPE-seq data from eight tissues of both Williams 82 and PI468916. For each tissue and variety, we computed pairwise Pearson correlations among all convolutional kernels in the feature-extraction layer. These correlations formed the edges of a connectivity graph, in which we retained only edges with correlation coefficients above 0.9. We then identified connected components whose size exceeded 30% of the number of trained models. Within each such component, the kernel (node) exhibiting the highest degree of connectivity was selected as its representative sequence pattern. After processing all tissues independently, we merged the resulting sequence patterns from both varieties and applied the same connectivity-based clustering procedure to the combined set. Following visualization of these merged patterns, we manually filtered Out redundant or overly similar motifs. This workflow yielded a final, non-redundant collection of 27 distinct sequence features for downstream analysis.

Ablation study

Most plant core-promoter motifs reside upstream of their cognate TSS, whereas downstream elements are rarer and typically lie within a few hundred base pairs. We therefore kept the downstream window short (0 to + 500 bp) so that any motif appearing in this region can be confidently assigned to the focal TSS. Upstream, the situation is more complex: when neighbouring TSSs lie close together, a motif located far upstream may ambiguously “belong” to more than one site. Our initial 3500 bp upstream window maximised coverage of possible cis-regulatory elements, but it also increased the risk of such ambiguity. To ensure that each motif we score is very likely to regulate the target TSS, and not an adjacent one, we have now restricted the effective cis window to − 2000 bp/+ 500 bp for all quantitative motif-effect calculations. The extra flanking sequence up to − 3500 bp (together with the 325 bp buffers at both ends) is retained only to provide consistent sequence context for the convolutional filters and is excluded from gradient updates, thereby preventing mis-assignment while preserving model stability.

Ablation experiments were performed to dissect the functional contributions of individual sequence patterns encoded in GenoRetriever’s convolutional layers. In this model, each convolutional kernel in the encoding network represents a distinct sequence motif or initiator element and uses a non-negative Sigmoid activation. By setting the post-activation output of a given kernel to zero, we effectively silence that sequence pattern. The effect of this perturbation is quantified by comparing the predicted transcription initiation signal before and after silencing. We carried out three scales of ablation analysis: (i) We simultaneously silenced all kernels corresponding to either motif patterns or Inr elements. By measuring the overall change in predicted TSS signal, we assessed the combined influence of each pattern class on transcription initiation. (ii) Each encoding kernel was silenced one at a time. For each kernel, we calculated the change in predicted TSS abundance and its positional effect across a defined window around the TSS. This yielded individual abundance-effect and position-effect profiles for every motif and Inr pattern. (iii) We simulated point mutations at each nucleotide position in the input sequence to identify critical variant sites. For every possible single-base substitution, we compared the change in predicted TSS signal. Variants causing the largest prediction shifts were flagged as key drivers of motif reconfiguration and transcriptional initiation changes, providing a robust approach for evolutionary analysis and regulatory target discovery.

Motif effect scoring

To quantify the impact of each motif on transcription initiation, we computed two complementary scores based on the ablation experiment outputs. For each transcription start site (TSS), we obtained two signal tensors: one predicted before motif silencing and one predicted after silencing. We extracted a window of one thousand base pairs centered on the dominant position of the TSS (five hundred base pairs upstream and downstream). By subtracting the post-silencing tensor from the pre-silencing tensor at each position, we generated an effect curve whose values represent the magnitude of abundance change induced by the motif at single-base resolution. To capture both positional shifts and overall abundance changes in a single metric, we converted the same pair of 1-kb signal windows into the frequency domain using a discrete Fourier transform. In frequency space, we computed the pointwise squared difference between the two spectra and summed across all frequency components. The resulting scalar score reflects the combined positional and abundance effects of motif removal on transcription initiation.

$$begin{array}{c}Score={Vert {F}_{1}left(kright)-{F}_{2}left(kright)Vert }_{2} where Fleft(omega right)={int }_{-infty }^{infty }fleft(tright){e}^{iomega t}domega #left(4right)end{array}$$

where f(t) is the time-domain signal (the effect curve), F(ω) is its frequency-domain representation, ω is the angular frequency, and iii is the imaginary unit.

In silico degradation experiment and critical variation identification

To investigate how sequence variants affect transcription initiation signals between soybean varieties, we performed an in silico degradation experiment. For each promoter, we collected all variants including single-nucleotide polymorphisms and small indels which located within 2-kb upstream and 500 bp downstream of the transcription start site. We then ordered these variants in reverse evolutionary sequence and, one by one, reverted each derived allele back to the ancestral allele in silico. After each reversion, we recalculated the contribution score of every motif to the predicted TSS signal, using the metric defined in formula (4). A variant was designated as a critical variation if its reversion caused the promoter’s dominant motif (the motif with the highest contribution score) to switch from its original major type to a different motif type. This procedure yielded a set of critical variations that drive shifts in motif hierarchy and thus in transcription initiation regulation.

T2T genome assembly and annotation

High-molecular-weight DNA was extracted using a CTAB-based protocol. For long-read sequencing, PacBio HiFi libraries were prepared and run on the Sequel IIe platform, while ultra-long read libraries for Oxford Nanopore Technologies were generated with the Ultra-Long DNA Sequencing kit and sequenced on PromethION. Hi-C libraries were constructed following Belton et al. [28] and sequenced on a BGI DNBSEQ-T7 system. Short-read libraries were made with the NEBNext Ultra II DNA Library Prep Kit and sequenced on an Illumina platform. The telomere-to-telomere assembly was produced by integrating PacBio HiFi and ONT reads using hifiasm (v0.19.8-r603) and NextDenovo (v2.5.2) [29, 30]. Contigs were screened against the NT database to remove plastid and ribosomal DNA sequences. Chromosome-level scaffolding used Hi-C data processed with Juicer (v1.6) [31], refined through the 3D-DNA pipeline (v180922) [32], and manually corrected in Juicebox (v1.9.8) (https://github.com/aidenlab/Juicebox). Assembly accuracy was confirmed by realigning HiFi and ONT reads and assessing completeness with BUSCO (v4.1.4) and Merqury (v1.3) [33, 34]. Transposable elements were annotated using EDTA (v2.0.0) [35], which integrates LTR_FINDER, TIR-Learner, and MITE-Hunter. Gene models were predicted through a combination of transcript evidence, homology, and ab initio methods. RNA-seq data from six tissues were aligned to the assembly with HISAT2 (v2.2.1) and assembled into transcripts using StringTie (v2.2.1) [25, 36]. These models were refined by PASA pipeline (v2.5.3) and Braker3 (3.0.6) [37, 38], yielding a high-confidence gene annotation.

Site-directed mutagenesis

Point mutations were introduced into recombinant plasmids by overlap extension PCR. Primers were designed with 15–20-bp overlaps flanking the mutation site. Each 10 µL PCR contained 50 ng template DNA, 0.5 µM of each primer, and 2 × Phanta Flash Master Mix. Cycling conditions were 95 °C for 2 min; 30 cycles of 95 °C for 15 s, annealing from 52 °C (increasing by 0.4 °C per cycle) for 30 s, and 72 °C for 2 min; followed by a final extension at 72 °C for 5 min. PCR products were treated with DpnI at 37 °C for 1 h to remove template plasmid and then transformed into E. coli DH5α. Successful mutants were confirmed by Sanger sequencing. Primer sequences are listed in Additional file 2: Table S11.

Promoter cloning and dual luciferase reporter assay

For dual luciferase assays, the regions upstream of the transcription start sites of Glyma.01G042700 (970 bp), Glyma.13G202000 (665 bp), and Glyma.14G008300 (992 bp) were amplified from Williams 82 genomic DNA. Each fragment was cloned into the pGreenII 0800 LUC vector upstream of the firefly luciferase gene using KpnI (R3142, NEB) and BamHI (R3136M, NEB) restriction sites. Point mutations were introduced by overlap extension PCR. Primers were designed with 15 to 20 bp overlaps flanking each mutation site. Each 10 µL PCR contained 50 ng of recombinant plasmid DNA, 0.5 µM of each primer, and 2 × Phanta Flash Master Mix (P510, Vazyme). Cycling conditions were 95 °C for 2 min; 30 cycles of 95 °C for 15 s, annealing from 52 °C increasing by 0.4 °C per cycle for 30 s, and 72 °C for 2 min; final extension at 72 °C for 5 min. PCR products were treated with 1 µL DpnI (R0176V, NEB) at 37 °C for 1 h to digest parental DNA, then transformed into Escherichia coliDH5α (CC96102, TOLOBIO) competent cells. Mutations were confirmed by Sanger sequencing. Wild-type and mutant constructs were transformed into Agrobacterium tumefaciens strain GV3101. Agrobacterium cultures were grown to OD₆₀₀ = 0.6, harvested by centrifugation, and resuspended in infiltration buffer (10 mM MES pH 5.6, 10 mM MgCl₂, 100 µM acetosyringone). After incubating at room temperature for 2–3 h, suspensions were infiltrated into the abaxial surface of Nicotiana benthamiana leaves. Plants were incubated for 48–72 h under standard growth conditions. Firefly and Renilla luciferase activities were measured using a Tecan Spark microplate reader. Primer sequences are listed in Additional file 2: Table S11.

Kowck down plasmid construction and soybean hairy-root transformation

For RNA interference of the GmYY1 (Glyma.08G126300 and Glyma.05G167900), GmABF1 (Glyma.02G131700 and Glyma.07G213100), GmDREB (Glyma.09G147200, Glyma.10G239400, Glyma.16G199000 and Glyma.20G155100), GmHY5 (Glyma.18G117100 and Glyma.08G302500), and GmTCP (Glyma.16G053900 and Glyma.19G095300) gene families, conserved regions of 200 to 300 bp were amplified from Williams 82 genomic DNA and cloned into the pZHY930 vector using KpnI (R3142, NEB) and SmaI (R0141, NEB) restriction sites. Recombinant plasmids were validated by Sanger sequencing to confirm insert integrity and orientation. Final constructs were transformed into Agrobacterium rhizogenesstrain K599 for soybean hairy‑root transformation. Transgenic roots were identified under a handheld lamp (3415RG, LUYOR), and knockdown efficiency was validated by RT‑PCR with gene‑specific primers. K599 carrying the desired constructs was used to generate soybean hairy roots following the protocol of Wang al. (2014). Composite seedlings were transferred to pots (13 × 10 × 8.5 cm) filled with vermiculite and grown under a 16 h light/8 h dark cycle at 25 °C and 50% relative humidity. Ten days after transfer, transformed roots were identified using a handheld lamp (3415RG, LUYOR). Transgenic roots were excised, immediately frozen in liquid nitrogen and stored at − 80 °C until further analysis.

CRISPR/Cas9 knockout vector construction

The CRISPR/Cas9 multiplex gene knockout vector pGES704 was employed to target the upstream open reading frames (uORFs) in the promoter regions of four soybean genes: GmbZIP123 (Glyma.06G010200), GmbZIP124a (Glyma.12G040600), GmbZIP124b (Glyma.11G114800), and GmbZIP125 (Glyma.04G010300). The pGES704 vector was derived from the base plasmid pGES401 [39], in which the selection marker gene bar was replaced with CP4 EPSPS. Six single-guide RNAs (sgRNAs) were designed in tandem to target these four genes (specific sequences and detailed information provided in Additional file 2: Table S11). Notably, some sgRNAs were engineered to simultaneously target 2–4 genes. The sgRNA-containing fragments were assembled with the vector using the NEBridge® Golden Gate Assembly Kit (BsaI-HF® v2), and successful plasmid construction was verified by Sanger sequencing.

Transgenic plant generation and genotyping analysis

The validated knockout vectors were transformed into Agrobacterium tumefaciens strain GV3101 for stable soybean transformation via the cotyledonarynode transformation method as described [40]. Genomic DNA was extracted from T₂ generation plants using the cetyltrimethylammonium bromide (CTAB) method, followed by PCR amplification with Taq Plus Master Mix II (P213, Vazyme). Target-specific primers for genotyping are listed in Additional file 2: Table S11. Mutation patterns of T₂ generation plants were identified through Sanger sequencing.

Quantitative RT-PCR analysis

Total RNA was isolated from leaves of T₂ plants at the same developmental stage using the FastPure Universal Plant Total RNA Isolation Kit (RC411-01, Vazyme). Reverse transcription was performed with HiScript III RT SuperMix for qPCR (R323, Vazyme). Quantitative PCR was conducted using ChamQ Universal SYBR qPCR Master Mix (Q711, Vazyme) on a real-time PCR system (qTOWER3G, analytikjena, Germany). Three biological replicates, each with three technical replicates, were conducted for each sample. Gene expression levels were calculated using the 2^−ΔΔCt method [41], with qPCR primers detailed in Additional file 2: Table S11.

Deciphering the sequence basis and application of transcriptional initiation regulation in plant genomes through deep learning | Genome Biology