Telomere-to-telomere genome assembly uncovers Wolbachia-driven recurrent male bottleneck effect and selection in a sawfly

Assembly and annotation of the Analcellicampa danfengensis genome

Initially, we generated approximately 14.69 GB of Pacific Biosciences (PacBio) HiFi reads ( ~ 68.89×), 54.80 GB of Oxford Nanopore (ONT) long reads ( ~ 256.99×), 47.21 GB of Hi-C reads ( ~ 221.39×) and 34.54 GB illumina paired-end reads ( ~ 161.98×) from pools of five adult females (Supplementary Data 1). Heterozygous k-mer pair coverage distributions from Smudgeplot18 revealed signals of diploidy (Supplementary Fig. 1a), consistent with previous studies19. The genome size of AD was estimated to be 213.24 Mb using GenomeScope20 (Supplementary Fig. 1b), close to the previously reported genome size of another sawfly species, Orussidae abietinus, at 201 Mb19.

We utilized various cutting-edge assembly software tools to construct the genome assembly. By almost every metric, Flye21, using ONT reads, produced the best draft genome assembly, showing the most contiguous (N50) and high per-base quality among all assemblies, as indicated by its superior Benchmarking Universal Single-Copy Orthologs (BUSCO)22 score (Supplementary Data 2). In contrast, LJA23 produced the lowest-quality assembly across all metrics, while Hifiasm24 in HiFi + ONT mode generated an intermediate-quality assembly. Therefore, we selected the Flye assembly for further scaffolding with Hi-C data and gap closure.

We used Hi-C data for scaffolding and then extended both ends of the scaffolds using HiFi reads to further elongate and link associated scaffolds (see Materials and Methods). This approach ultimately scaffolded the genome into 10 pseudo-chromosomes (Fig. 1a, b), during which we also manually closed 77 gaps using the sequence-extension method. The final genome assembly was 211.14 Mb, featuring a contig N50 size of 22.39 Mb and a GC content of 41.99% (Supplementary Data 2). The Hi-C contact map showed stronger interaction intensity along the diagonal compared to non-diagonal regions, with no significant noise outside the diagonal, suggesting a high-quality chromosome assembly (Fig. 1a). The largest scaffold reached 57.04 Mb (Chr1), while the smallest was 13.31 Mb (Chr6). Seven scaffolds achieved telomere-to-telomere (T2T) assembly, though two scaffolds each still contained one unresolved gap each (Chr4:10537477-10537560 and Chr9:7931267-7931340) (Fig. 1a). Assembly completeness was evaluated using the insecta-specific and hymenoptera-specific BUSCO database, showing 99.85% and 97.91% completeness, respectively, indicating robust genome integrity (Fig. 1c and Supplementary Data 2). Quality-control mapping of Illumina reads to the final assembly yielded an exceptionally high mapping rate of 99.83%. HiFi reads achieved a mapping rate of 99.93%. ONT reads achieved 98.83%, and RNA-seq data showed a mapping rate of 91.76%. These results further indicate the high quality of our genome assembly. Comparative assessments with AD and Orussidae abietinus19 genomes showed significant enhancements in the AD assembly, with a contig N50 approximately 17 times that of Orussidae abietinus and almost all gaps from previous versions filled (Supplementary Fig. 2). In summary, the assembly resulted in seven gap-free chromosomes, establishing AD as a high-quality chromosomal-level reference genome.

Fig. 1: Assembly and key characterization of the A.danfengensis T2T genome, including centromeric characterization, symbiotic Wolbachia and mitochondrial DNA.

a Genome-wide Hi-C contact matrix of A.danfengensis. Red color intensity in the heatmap shows frequency of interaction between two loci at 25 kb resolution. b Distribution of genomic features of the A.danfengensis genome. Tracks are aggregated in 100-kb bins as follows: I, gene density; II, TE density; III, TR density; IV, GC content; V, SNPs density; VI, the 9 chromosomes (chr1–chr9). c BUSCO completeness scores for the host genome using insecta and hymenoptera databases, and for Wolbachia genome using rickettsiales database. d Features of the three largest centromeres: Chr1 (~4 Mb); Chr3 (~4 Mb); Chr8 (~3.5 Mb). e Circular map of the wAnd genome. The outermost tracks 1 and 2 represent the positions of the CDS, tRNA, rRNA, and tmRNA genes on the positive and negative strands. Tracks 3 and 4 tracks show the GC content and GC skew, respectively. f Features of the mitochondrial genome.

Moreover, we examined whether telomeres and centromeres were present in our assembled genome. The results showed that telomeric regions could be detected on both ends of nine chromosomes. The analysis predicted nine centromeric regions and identified 18 telomeres by recognizing the five-base telomeric repeat (CCTAA/TTAGG)25. Centromeric and telomeric regions are detailed in Supplementary Data 3 and 4, respectively. The approximate locations of the nine centromeric regions was estimated based on repeat density and the Hi-C interaction heatmap. We observed that most centromeres were positioned near to the middle parts of the chromosomes, with a few located at the chromosomes’ terminals, including Chr5 and Chr6 (Fig. 1a, b and d, and Supplementary Data 4). The centromeric regions ranged in size from 124.98 kb to 4.54 Mb, with an average length of 1.96 Mb (Supplementary Data 4). The shortest centromeres are the telocentric, found on Chr5 and Chr6. Notably, “blank regions” observed in the Hi-C interaction map (Fig. 1a) corresponded with a dense accumulation of tandem repeats, especially in the centromeric regions (Fig. 1b, d), which is consistent with other T2T genome assemblies26,27. Centromeres vary greatly in their sequence organization among species28. Additionally, centromeric regions exhibit several notable common features. In AD, the centromeric regions are primarily composed of tandem repeats, accompanied by a reduction in GC content, a decrease in transposable element (TE) content, and a significantly lower gene density (Fig. 1b). These distinct characteristics, compared to other genomic regions, further support the identification of centromeres.

A total of 9569 protein-coding genes were predicted from the chromosomal-level genome of Analcellicampa. While this number is lower than that found in some model insects like Drosophila, it is comparable to gene counts reported for other sawfly species (e.g., Orussus abietinus19 and falls within the range observed across basal Hymenoptera. Of these, 4716 genes were assigned Gene Ontology (GO) terms, 4282 genes were assigned Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and 6787 genes contained Pfam domains. Additionally, 6478 genes had BLASTN matches in the Swiss-Prot protein database. Repeat sequences constituted 32.42% of the AD genome, of which retrotransposons accounted for 12.29%, DNA transposons for 5.73%, and unclassified elements for 12.59%.

Assembly and annotation of the endosymbiotic Wolbachia in sawfly

We generated a circular contig in our assembly (Supplementary Fig. 3), and sequence homology analysis confirmed it was Wolbachia. The circular structure indicates the completeness of the assembled sequence. The Wolbachia strain was detected in AD, thus we named it wAnd. The wAnd genome is 1,236,102 bp in size and has a GC content of 35.14% (Fig. 1e and Supplementary Data 5). Using Prokka software29, we annotated the genome, identifying a total of 1166 genes were identified, 3 rRNA genes, 34 tRNA genes, and 1 tmRNA (Fig. 1e). The BUSCO completeness scores of the wAnd genome indicated that it contained 361 complete and single-copy BUSCO groups, 1 complete and duplicated BUSCO groups, 0 fragmented BUSCO groups, and 2 missing BUSCO groups (Fig. 1c). The BUSCO score of 99.5% for the wAnd genome was comparable to that of other circular, chromosome-level Wolbachia genomes in supergroup A30 (Supplementary Data 6). The genomic features of wAnd, including coding sequences (CDS), tRNA, rRNA, and others, were visualized using Proksee31 (Fig. 1e). The genome exhibits the GC skew pattern typical of Wolbachia genomes32,33. The presence of an irregular GC skew in wAnd masks the identification of a specific origin of replication, indicating potential frequent genomic rearrangements34. In summary, we discovered Wolbachia parasitism in the sawfly and successfully assembled a high-quality Wolbachia genome.

We assembled a complete mitochondrial genome of 199,511 bp, encompassing all 13 protein-coding genes, 22 tRNA genes, 2 rRNA genes, and a control region (Fig. 1f and Supplementary Data 7). The control region is situated between trnQ and trnI and measures 4064 bp in length.

Resequencing data collection and high-quality SNPs construction

In this study, we collected 89 high-quality whole-genome resequencing samples (including males and females where available, see Supplementary Data 8) from across China (Fig. 2a). Specifically, we obtained 30 AD, 26 A.xanthosoma (AX), 7 A.acutiserrula (AA), 13 A.maculidorsatus (AM), 4 A.wui (AW) and 9 A.emei (AE) for whole-genome sequencing and analysis (Supplementary Data 8). We used two species Monocellicampa pruni and Monocellicampa yangae in closely related genus, as an outgroup. All 91 genomes were aligned against the AD reference genome. The lowest alignment rate was observed in the outgroup at 69.67%, while the average mapping rate for the Analcellicampa species was 89.07% (see Supplementary Data 8 for alignment rate per species). Among the Analcellicampa species, the lowest alignment rate was observed in AE at 78.87%, while the highest alignment rates were observed in AD at 99.90%, followed by AX, with rates of 99.90% and 93.65%, respectively (Supplementary Data 8). A total of 20,292,172 putative bi-allelic single-nucleotide polymorphisms (SNPs) were identified and passed the filtering criteria across the 91 genomes for subsequent analyses, of which 11.83 million (39.62%) were intergenic, 6.83 million (22.86%) were intronic and 1.53 million (5.13%) were exonic (Supplementary Data 9). We observed the highest number of SNPs per individual in the AE, with 3.39 million SNPs per individual, corresponding to ~24.51% more SNPs than the average of 2.73 million (Supplementary Data 8). To further assess the quality of the genetic variants, we calculated the transition-to-transversion (Ts/Tv) ratio, an indicator of potential sequencing error35. In our study, the Ts/Tv ratio for global populations SNPs was found to be 1.99, close to 2, indicating that the genetic variants identified are of high quality and their distribution is relatively balanced.

Fig. 2: Population structure, phylogeny, and demographic history of Analcellicampa.
figure 2

a Sampling sites. The bar charts indicate the species and number of samples collected at each sampling location. Colors correspond to those in (d). b Maximum-likelihood tree depicting the evolutionary relationships among genus Analcellicampa and outgroup. c ADMIXTURE analysis with K = 3, 7 and 8. Colors in each column represent ancestry proportion. d Nucleotide diversity (π), nucleotide differences (dxy) across the six species. The value in each circle represents a measure of nucleotide diversity for each species; values in red on each line indicate pairwise population nucleotide differences between species. e Patterns of LD (linkage disequilibrium) decay across the genome in different geographic populations. r2, Pearson’s correlation coefficient. f Genomic similarity of six species of Analcellicampa to the AD reference genome. Chromosomes are indicated by different colors along the left y axis. Identical score (IS) values are shown for SNPs within each 50-kb window across the genome. g Estimated split times between each species after 50 bootstraps. The widths indicate probability densities. h Dynamic of Ne inferred by PSMC. The vertical lines indicate divergence times corresponding to those shown in (g). The marked pentagrams indicate species fully infected with Wolbachia.

Population characterization and evolutionary history of sawfly

To decipher genetic relationships among Analcellicampa sawfly populations, we constructed a Maximum Likelihood (ML) tree and performed principal component analysis (PCA) on the above 91 individuals using host nuclear genome SNPs (hereafter referred to as autosomal SNPs, in contrast to symbiont or mitochondrial SNPs). Our phylogenetic and genetic clustering analyses resolved six genetic groups (Fig. 2b and Supplementary Fig. 4), which corresponded to species classifications based on morphological characteristics (Supplementary Fig. 5 and 6). Using Monocellicampa species as outgroup, the phylogenetic analysis revealed that individuals from the same population clustered together, with AE and AM forming a clade distinct from the other genetic lineages in the ML tree (Fig. 2b). Higher π values observed in AE and AM (with the exception of AX) reflected their rich genetic diversity (Fig. 2d), potentially indicating that species from South and East China have more ancient origins compared to other populations (Fig. 2a). Species from the northern and western regions clustered together, with each species forming stable genetic clusters (Fig. 2a, b). Population structure inferred by ADMIXTURE36 also aligned with the six genetic groups when the optimal K = 7, with AD further subdivided into two subgroups corresponding to their geographical distribution (Fig. 2c and Supplementary Fig. 7). Additionally, when K values ranged from 2 to 6, the ancestry component of AE and AM accounted for the largest proportion in the outgroup (Supplementary Fig. 7), supporting the previous finding suggested by π values that populations from South and East China have more ancient origins compared to other populations.

Notably, AX and AD formed a sister clade (Fig. 2b), with their dxy value being the lowest among all populations (0.017, Fig. 2d). The geographic ranges of the two species overlapped, though AX had a broader distribution, with a few individuals still found in Hunan and eastern regions (Fig. 2a and Supplementary Data 7). AX exhibited the second-highest nucleotide diversity (π  =  7.61 × 10−3), following AM (π  =  8.97 × 10−3), while AD and AW showed the lowest nucleotide diversity (Fig. 2d). Linkage disequilibrium (LD) decay analysis showed slower decline in LD for AW and AD (Fig. 2e), suggesting smaller population sizes and further supporting their lower nucleotide diversity. Identity Score (IS) analysis demonstrated a high level of genomic similarity between AX and AD (Fig. 2f). The dxy between AD and other species was approximately 0.03, with genomic IS reaching at least 0.6, indicating a close genetic similarity among populations and providing a solid foundation for our subsequent population genetic analyses using AD as the reference genome.

We also examined possible gene flow between species using TreeMix (Supplementary Figs. 8 and 9), which calculates a phylogeny of populations based on shared drift and tests whether migration edges (i.e., introgression) improve the model fit. The overall topology of the tree was consistent with that of the ML tree (Fig. 2b), confirming the reliability of this tree’s topological structure. Likelihood improvements declined after four migration edges (Supplementary Fig. 8), so we presented results using this value (Supplementary Fig. 9), with 100 bootstraps performed. At four migration edges, TreeMix identified evidence for low levels of gene flow (migration weight < 0.05) from AA into AX, AE into AD, and a higher level of gene flow ( ~ 0.2) from AD into outgroup (Supplementary Fig. 9).

Furthermore, we explored the demographic history of the sawflies by using site-frequency spectrum (SFS) via momi237 and inferred changes in effective population size (Ne) over time for each population with the pairwise sequential Markovian coalescent (PSMC) model38. Since low levels of gene flow have negligible impact on timing inference39,40 and our study primarily focused on changes in Ne across species, we tested two models—one assuming constant population size and another allowing for variable population size—to seek the best one with lowest AIC value (see Materials and Methods and Supplementary Fig. 10). Based on this best-fit model (Supplementary Fig. 10), we inferred that the most recent common ancestor of the genus Analcellicampa diverged approximately 857,445 years ago (95% CI, 844.17 to 870.72 Kya). The basal clade within sawflies, containing AE and AM, separated from other lineages 368,609 years ago (95% CI, 313.15 to 384.07 Kya). Within Analcellicampa, the most recent divergence occurred between AE and AM at 79,678 years ago (95% CI, 74.39 to 84.97 Kya), followed by AD and AX at 185,866 years ago (95% CI, 174.66 to 197.07 Kya) (Fig. 2g and Supplementary Data 10). In the optimal model, we allowed population size to vary from the Last Glacial Maximum to the present. All populations showed a decline in Ne, though AW exhibited the fastest decline, followed by AD. This result aligns with the previous LD decay analysis, where these two populations demonstrated the slowest LD decay (Fig. 2e).

The demographic history of the six species in our study was first inferred by analyzing the whole-genome sequence using Pairwise Sequentially Markovian Coalescent (PSMC) model38 (Fig. 2h). The inferred demographic histories of these six species spanned from approximately 10 million years ago (Mya) to 10,000 years ago (Kya). Since all species analyzed here are younger than 10 million years, the inferred demographic dynamics likely reflect an ancestral form with a potentially different geographic distribution (Fig. 2h). All sawfly populations experienced two population bottlenecks, one around 0.4 Mya (Bottleneck 1, B1) and another near 20 Kya (B2) (Fig. 2h). Correlating species divergence times with Ne dynamics, we observed that the formation of the Analcellicampa genus occurred approximately 857 Kya, coinciding with the early phase of B1 (Fig. 2g, h). The sawfly Analcellicampa, being monophagous with their larvae that exclusively parasitize cherry fruits (Cerasus spp.), likely experienced a founder effect at this bottleneck stage. Following B1, Analcellicampa underwent rapid radiation, accompanied by an increase in Ne. Over a span of approximately 300,000 years, six distinct species emerged, among which AD experienced a notable Ne expansion post-divergence. During the Last Glacial Maximum (LGM, ~ 20 Kya), all populations experienced a decline in Ne, leading to the second bottleneck (B2).

To sum up, our results provided a comprehensive overview of the evolutionary history of Analcellicampa. The earliest emergence of the genus was around 857 Kya, originating in the southwestern region and subsequently spreading northeastward, resulting in six distinct Analcellicampa species in China. The most widely distributed species was AX, which diverged from its sibling species AD approximately 185 Kya. We did not observe any introgression events from AD, the species for which we constructed the reference genome, into other sawfly species. Overall, despite ~857 thousand years of divergence, these populations maintain relatively high similarity, with IS values above 0.6.

Discordance between phylogenetic relationships of the sawfly and its symbiotic Wolbachia

The transmission modes of Wolbachia play a crucial role in its spread and coevolution with hosts41,42. Our comprehensive genomic dataset allowed us to explore Wolbachia evolutionary dynamics within host populations, particularly to assess the extent of intraspecific horizontal transmission. Through our Wolbachia detection pipeline (See “Materials and Methods”), we observed infections in four sawfly species: AW, AM, AX and AD (Supplementary Data 8). AW and AD were fully infected, while AM and AX showed partial infection, with AM having an infection rate of 3/13 among sampled individuals and AX exhibiting an infection rate of 1/26. Maximum likelihood phylogenetic trees of these Wolbachia strains classified them into three genetic groups: the first group parasitized only AW; the second included two individuals from AM; and the third encompassed all AD, one individual from AM, and the single Wolbachia strain found in AX (Fig. 3). The Wolbachia strains found in AW form a distinct early-diverging lineage in the phylogenetic tree, suggesting a long independent evolutionary history of this host-symbiont partnership (Fig. 3). The Wolbachia strains from AM were resolved into two distinct clades, while those from AD formed several closely related terminal clades with a lower genetic diversity (Type 1: 4.95 × 10-3 vs. Type 3: 2.05 × 10-4) and relatively fast LD decay (Supplementary Fig. 11). Two Wolbachia strains from AM formed a sister group relationship with the AD-associated strains, suggesting a shared evolutionary history between these infections. The single Wolbachia strain detected in AX was nested within the AD-associated clade, with both species sharing overlapping habitats. This relationship suggests a potential recent horizontal transfer from AD to AX, likely facilitated by ecological factors such as shared habitats and host density, aligning with the broader understanding that environmental context plays a critical role in host-symbiont evolution43.

Fig. 3: Phylogenetic relationships of Wolbachia-infected host and their Wolbachia.
figure 3

a Phylogeny of host. b Phylogeny of Wolbachia. The numbers at the major branches are bootstrap values.

When comparing host phylogeny with that of symbiotic Wolbachia, we found that the overall topologies of the host and Wolbachia trees were not fully congruent. However, some clusters maintained congruent phylogenetic relationships. For example, the phylogenetic patterns between AW and its associated Wolbachia showed strong congruence, suggesting stable host-symbiont associations likely due to long-term coevolution. The distribution of Wolbachia in AD showed partial congruence with host phylogeny, with both host and symbiont phylogenies revealing two main clusters that correspond to geographical distribution. However, within these clusters, the detailed phylogenetic relationships between hosts and their Wolbachia were discordant; For instance, while A.danfengensis_MC0301 and A.danfengensis_MC0303 formed sister clades in the host phylogeny (Fig. 3), their associated Wolbachia strains showed greater phylogenetic divergence–this pattern was commonly observed within AD. These phylogenetic patterns are consistent with multiple modes of Wolbachia acquisition described in previous studies11, including vertical inheritance during host speciation, hybridization and gene flow between closely related hosts, and horizontal transmission events. In our study, these phenomena also reflect the existence of significant hybridization and horizontal transfer events.

Divergent demographic signals in male and female genomes reflect a Wolbachia-driven male bottleneck

PSMC38 was used to infer historical changes in effective population size (Ne). The most dramatic demographic changes occurred between 20 and 200 Kya (Fig. 2h), although the timing of population expansions and contractions varied among species. Interestingly, inferred Ne trajectories differed dramatically between male and female AD during this period (Fig. 4). To confirm this result, we inferred the demographic history of AD from three geographically distinct populations and performed 100 bootstrap replicates. The results remained robust, consistently showing significant differences in Ne between males and females during the 20–200 Kya period. Specifically, female Ne increased sharply around 200 Kya, peaking around 70–80 Kya, where Ne inferred from female genomes was 13 times larger than the apparent Ne inferred from male genomes (13 × 105 vs. 1 × 105), before declining during the LGM to approximately 1 × 105 (Fig. 4a). In contrast, the male Ne remained relatively stable throughout this period, with a slight decline during the LGM.

Fig. 4: Contrasting demographic signals inferred from male and female genomes over time, estimated using PSMC and SMC++.
figure 4

a Ne dynamics of females and males in three geographically distinct AD. b, c Ne dynamic of females and males in AX and AM, with Wolbachia-uninfected individuals used for the analysis.

PSMC analyses based on re-sequenced data (Fig. 4) enabled us to compare the demographic histories across the different populations. We included AX and AM as comparison groups (Fig. 4b, c), as these populations contained predominantly uninfected individuals with only a small proportion showing recent Wolbachia infection. Using data from uninfected individuals, our analyses of these uninfected populations revealed nearly identical Ne trajectories inferred from both male and female genomes (Fig. 4b, c). This demonstrates that our analytical approach does not inherently create a sex-biased signal, and the stark differences observed in infected populations (Fig. 4a) are indeed a consequence of Wolbachia infection.

PSMC results become less reliable for recent timeframes (within the last 10,000 years) due to reduced inference power38. To address this limitation, we employed the SMC + + 44, which enable more precise inference of population history within the last 10,000 years. The demographic trajectories inferred by SMC + + prior to 10,000 years ago were largely consistent with the PSMC results, confirming an increased Ne of AD females around 70–80 Kya. However, SMC + + revealed that within the past 10,000 years, Ne inferred from female genomes remained stable, while the apparent Ne inferred from male genomes experienced a dramatic decline, decreasing from approximately 10,000 to fewer than 1000 individuals, a signal consistent with an intensifying male bottleneck. In contrast, analyses of uninfected populations (AX and AM) exhibited consistent demographic trajectories from both male and female genomes throughout their entire evolutionary history, despite differences in the range of Ne changes between the two species.

The momi2 results, which model the demography of the population as a whole, further supported the overall population declines observed in AD and AW. We tested two models: one assuming constant Ne and another allowing for variable Ne. The variable Ne model, yielding the lowest AIC, demonstrated that Wolbachia-infected AD and AW experienced the fastest declines in overall population Ne, and their current population sizes remain at low levels (Supplementary Fig. 10 and Supplementary Data 10).

In summary, our analyses reveal complex temporal changes in host population dynamics associated with Wolbachia infection. While previous studies have largely focused on Wolbachia’s male-killing effects and their role in causing sex ratio imbalances, our findings revealed a more complex picture of how these effects are reflected in the genome. In AD, Ne inferred from female genomes increased sharply around 70-200 Kya, followed by a decline during the LGM, whereas the apparent Ne inferred from male genomes showed a prolonged and significant decline (Fig. 4). This divergence in signals indicates that the observed bias is not solely a prolonged intensification of the male bottleneck but reflects complex host-symbiont interactions over time.

Genomic signatures of Wolbachia-induced changes in AD

Leveraging the unique system of sympatric sibling species—AD (fully infected) and AX (nearly uninfected)—with overlapping distributions and shared ancestry, we had a rare opportunity to explore the evolutionary consequences for the host genome resulting from Wolbachia’s modification of host reproduction. By focusing on these populations, we minimized confounding variables, such as environmental effects, which are known to influence Wolbachia’s impact on sex determination45,46. This system, with its genomic similarity and robust sample sizes, allowed us to investigate Wolbachia-induced genetic modifications in an ecologically controlled context.

Specifically, we analyzed whole-genome sequences of AD (n = 30) and AX (n = 26) to investigate Wolbachia-induced modifications in the host genome. To identify the genomic region that presents signatures of natural selection associated with Wolbachia-coevolution, we performed whole-genome genetic differentiation analysis by estimating the cross-population composite likelihood ratio (XP-CLR) and differences in nucleotide diversity (π log2-ratio Wolbachia_positive/Wolbachia_negative) between AD and AX in 10-kb windows with a 5-kb sliding window along the genome. Applying a 5% threshold for maximum haplotype frequency difference (XP-CLR) and nucleotide diversity (log2-π ratio), we identified a total of 3.94 Mb genomic regions that covered all outliers (Fig. 5a and Supplementary Fig. 12). These regions contained 287 annotated positively selected gene (PSG) candidates (Supplementary Data 11).

Fig. 5: Selective sweep and enrichment analysis of the Wolbachia-infected AD.
figure 5

a Distribution of the log2ADAX) and XP-CLR values were calculated in 10 kb sliding windows with 5 kb steps. The horizontal and vertical lines represent threshold lines of the top 5% of the XP-CLR and π ratio values, respectively. Points (red) located in the top right sector represent selective signatures in AD. Yellow and black bins in the histograms of XP-CLR (right) and π ratio (top) represent levels respectively higher and lower than the threshold line. b Significant GO enrichment of PSGs associated with Wolbachia-induced reproductive manipulations.

Gene Ontology (GO) analysis of the PSGs revealed significant enrichment in categories associated with reproductive processes, sensory and neural development, among others (Supplementary Data 13). Notably, the most highly enriched Gene Ontology (GO) terms were those associated with sensory system development (P = 3.0 × 10-4, Benjamini-Hochberg corrected) and compound eye development (P = 4.0 × 10-4) (Fig. 5b). Enrichment was also observed in reproductive development categories, such as germ-line stem cell population maintenance (P = 1.3 × 10-3), germ-line cyst formation (P = 1.4 × 10-2), male sex differentiation (P = 1.4 × 10-2) and sex differentiation (P = 3.8 × 10-2) (Fig. 5b). Additionally, neural development terms like synaptic signaling (2.7 × 10-3), synapse assembly (P = 9.9 × 10-4), synapse organization (P = 1.3 × 10-2), regulation of synapse organization (P = 1.2 × 10-2), regulation of synapse structure or activity (P = 6.9 × 10-3), synapse organization (P = 1.3 × 10-2), neuron fate commitment (P = 2.7 × 10-3) and learning or memory (P = 1.2 × 10-2) were significantly enriched (Fig. 5b). Notably, we also observed significant enrichment in the category of determination of adult lifespan (P = 4.4 × 10⁻2) (Fig. 5b), suggesting potential selective pressures related to longevity and its influence on population dynamics under Wolbachia infection.

We identified several reproduction-related genes potentially linked to Wolbachia’s sex-manipulation mechanisms, corresponding to the significant demographic differences observed between males and females in our study. Among the reproductive GO terms we identified, the following genes were included: mys, otk2, puc, Six4, alien, Cbl, Inx2, dally, Rap1, and Rtel1. Among these, Inx2 exhibited the strongest selection signal with XP-CLR = 1297.74 and log2(π ratio)=4.85, while puc showed the lowest signal with XP-CLR = 700.75 and log2(π ratio)=3.69 (Supplementary Data 11).

Notably, Six4, which encodes a transcription factor essential for embryonic gonad formation, has been implicated in sex determination pathways. Its mammalian homolog regulates Sry expression, a gene essential for male development, and the loss of Six4 function can result in the reversal of sex from male to female in mice47. This gene similarly contributes to Wolbachia-induced feminization of genetic males, as observed in some isopod (Crustacean) species and lepidopteran species5, where Wolbachia cause male embryos to develop as functional females. This process may underlie the increased female population numbers observed in our study.

Several of the above genes are associated with sperm formation, highlighting Wolbachia’s potential influence on male reproductive capabilities. For instance, Rtel1, a regulator of telomere elongation helicase 1, plays a role in male germline stem cells regulation48. asun is essential for spermatogenesis in Drosophila, with mutations leading to male sterility49. Dcst2 is required for sperm-egg fusion, with knockout resulting in impaired fertilization in zebrafish50. Additionally, genes such as Otk2, linked to male genitalia development51, and Inx2, which encodes an innexin protein critical for gap junction communication during oogenesis, may influence oocyte development and fecundity52, underscoring their importance in the development of the male reproductive system.

Beyond the genes enriched in the aforementioned GO terms, we identified three additional reproduction-related genes, kl-2 (XP-CLR  =  978.52, log2(π ratio)=4.03), 5-HT7 (XP-CLR  =  868.21, log2(π ratio)=4.88) and gb (XP-CLR  =  810.20, log2(π ratio)=4.71). kl-2 is a male fertility factor, is involved in flagella movements53. The remaining two genes are linked to mating behavior. 5-HT7, a serotonin receptor, regulates sexual receptivity in virgin female Drosophila melanogaster, and its absence significantly reduces the mating rate of female flies54. Meanwhile, gb (genderblind), encoding an amino acid transporter involved in glutamate secretion and synaptic transmission, was identified as a target of selection and is known to affect male courtship behavior in Drosophila55. Notably, a majority of these reproduction-related genes (8 out of 13) are located on Chr1 (Supplementary Data 11 and Supplementary Fig. 12), suggesting potential chromosomal clustering of selection signals.

To gain deeper insights into the specific genes potentially driving this adaptation, we further annotated variants within the key reproduction-related candidate genes (e.g., mys, otk2, puc, Six4, alien, Cbl, Inx2, dally, Rap1, Rtel1, kl-2, 5-HT7 and gb) using snpEff56. This analysis revealed striking allelic differentiation between the Wolbachia-infected (AD and AW) and uninfected (AX) populations for six of these 13 genes (Supplementary Data 12), driven largely by missense mutations suggesting potential alterations in protein function. To assess the likely functional impact of these amino acid changes, we employed the SIFT algorithm57. This predictive analysis identified one missense SNP within the Cbl gene as “deleterious”. As a known negative regulator of EGFR signaling critical for Drosophila oogenesis58, and given that EGFR signaling is also implicated in male germline stem cell regulation and spermatogenesis59, this deleterious Cbl variant likely impacts male reproduction by dysregulating this pathway. Its signature of positive selection suggests Cbl is a target of host adaptation in response to the male-specific pressures imposed by Wolbachia. A potentially deleterious mutation fixed or at high frequency in the infected population suggests strong selection acting on Cbl function, possibly as a direct response to Wolbachia infection or as an adaptation to the altered reproductive environment caused by male scarcity.

This finding pinpoints Cbl as a specific gene, carrying a potentially deleterious mutation under strong selection, that warrants future investigation into host-symbiont coevolution in this system. More broadly, the presence of distinct, functionally relevant allele states (including deleterious variants like the one in Cbl) within key reproductive genes distinguishing infected and uninfected populations provides tangible evidence for the profound genomic reshaping driven by Wolbachia. These findings underscore the potential for Wolbachia infection to drive significant genomic changes in host genes related to reproduction, development, and behavior, highlighting the multifaceted impact of Wolbachia on host biology. By exerting selective pressures on critical biological processes, Wolbachia likely enhances its own transmission success60, while simultaneously altering host sex ratios and driving male extinction through complex, symbiont-mediated genetic changes.

Continue Reading