In this study, we established a methodology for single-molecule genome RNA sequencing within cultured influenza virus populations and demonstrated the potential for experimental mutation prediction through sequence distribution analysis. To validate this approach, the sequencing of UMI-tagged linearized plasmids was conducted. The analysis revealed that focusing solely on sequences with three or more reads that shared the same UMI led to an enhancement in sequence accuracy by more than tenfold. This resulted in an error rate that was on the order of 10⁻⁵ per bp per read. Assuming that PCR and sequencing introduce errors at a rate of approximately 10⁻³, the probability of two or more errors occurring at the same base position among three reads theoretically decreases to the order of 10⁻⁶. The discrepancy between this theoretical value and our measured data is likely attributable to mutation-prone regions, such as homopolymers and amplification bias, where error-containing sequences were preferentially amplified and detected as majorities.
It is widely acknowledged that PacBio sequencers frequently generate errors in homopolymer regions, and our study observed analogous trends. Specifically, when the UMI redundancy threshold was set to 3, error rates in homopolymer regions were found to be higher than those in non-homopolymer regions. However, the integration of UMIs has led to a significant reduction in this error gap, thereby validating the efficacy of UMI technology in addressing erroneous reads. This finding indicates that UMI-based methods have the potential to enhance the accuracy of PacBio, especially for genes and sequences that contain homopolymer regions. The implications of this study extend to other single-molecule sequencing platforms such as nanopore.
A subsequent analysis of RNA from virus populations derived from single virus particles exhibited elevated error rates when compared to in vitro transcribed RNA, a phenomenon that is presumably attributable to mutations introduced during viral replication. The observed mutation distribution comprised both mutations consistent with a neutral, Poisson-like accumulation and mutations that deviated substantially from a Poisson distribution. This pattern indicates the coexistence of neutral and non-neutral mutations within the viral population, forming a quasi-species structure. The deviation from Poisson expectations suggests that certain mutations were subject to selective pressures, likely influencing replication efficiency or protein function under the specific culture conditions. For instance, the mutation detection rate near the HA antigenic site (amino acids 180–200) was 1.62×10⁻⁴, approximately 1.5 times higher than the genome-wide mutation rate, highlighting a potential hotspot under positive selection. This finding corroborates prior reports of higher sequence variability in antigenic regions (Thyagarajan and Bloom, 2014; Wu et al., 2020). On the other hand, the observed mutation rates among genes do not align with the findings from previous phylogenetic research Eisfeld et al., 2014 on ‘highly conserved’ and ‘highly divergent’ genes, suggesting a lack of correlation between the distribution size and the evolvability of each gene. Such discrepancies may reflect differences in observation timing. Traditional phylogenetic analyses capture fixed mutations shaped by long-term selection, while our study detects earlier-stage mutations that have yet to undergo full selective filtering. Thus, the weak correlation with phylogenetic conservation likely arises because many observed mutations are still under selection. A comparison between RNA extracted from viral populations and in vitro transcribed RNA revealed greater protein sequence diversity in the former, as quantified by Shannon entropy. This greater diversity reflects the accumulation of mutations during replication and the latent evolutionary potential of viral populations.
The reference sequences employed for mapping viral genomes in this study were derived from single particles that contributed to the formation of each virus population. Nevertheless, subtle differences were observed among the consensus sequences from four virus populations, suggesting that even within the same PR8 strain, various mutations had accumulated during laboratory passaging, resulting in genetically diverse populations at the outset. This finding suggests that the experimental strain already possessed a mutation pool, and the observed mutation distribution reflects this background diversity. A comprehensive understanding of the effects of long-term passaging on viral population structure and mutation origins is imperative to obtain significant insights.
As this study did not impose specific selective pressures, we did not observe a significant increase of particular mutations previously linked to drug resistance or host adaptation was not observed within the populations. However, resistance mutations such as I38M, which have been demonstrated to confer resistance to the endonuclease inhibitor baloxavir (Jones et al., 2021; Taniguchi et al., 2024), were detected (see Figure 3—source data 1 and 2 for a list of all mutations detected). Conversely, mutations fixed in PR8-related strains were already present in populations derived from single particles. These findings imply that the viral quasi-species may serve as a latent genetic reservoir, from which advantageous variants can be selected in response to environmental pressures. While genetic variation was also detected in HA and NA, we did not impose drug or immune selection pressure in this study. Therefore, we did not expect to observe mutations that are already known to confer major antigenic changes in these proteins, and we consider it difficult to speculate on their functional implications in this context. Nevertheless, the detection of resistance-associated mutations indicates that the quasi-species pool may indeed harbor functionally relevant variation, even in the absence of explicit selective pressures. Thus, the real-time observation of mutation proliferation under diverse culture conditions will yield pivotal insights into the mechanisms underlying existing mutation expansion, thereby facilitating the prediction of novel mutations.
The predominant paradigm in evolutionary biology is the neutral evolution hypothesis, which posits that most evolutionary processes can be explained by random genetic drift. Consequently, elucidating the origins of these evolutionary processes is paramount for making accurate evolutionary predictions. A comprehensive analysis of neutral mutations necessitates the quantification of minor variants. The sUMI method was employed to detect mutations present at 0.1% frequency in populations by sequencing 10,000 molecules. Furthermore, the sequencing error rate was reduced to the order of 10⁻⁵, comparable to reverse transcriptase error rates. This enabled theoretical detection of mutations at a frequency of 0.05% by analyzing over 100,000 molecules with high accuracy. It is anticipated that this approach will yield comprehensive insights into mutation occurrence rates and distributions of mutations in neutral evolution. Furthermore, we have demonstrated the applicability of this method for mutation forecasting by using logistic modeling based on mutation fitness and initial frequencies. Subsequent applications will encompass the comprehensive detection and quantitative estimation of adaptive mutations under diverse environmental conditions, including the presence of drugs and different host species.
In summary, experimental evidence has demonstrated the efficacy of UMI technology in reducing sequencing errors and accurately measuring mutation distributions within viral populations. Furthermore, evidence was presented demonstrating that sequence distributions within individual populations manifest non-random directional biases. With the continued development of methods to quantify mutation bias and latent evolutionary potential, it is anticipated that laboratory-scale prediction of drug-induced mutations and pandemic-capable strains will become a reality. The broad distribution of mutations indicates that viral populations possess diverse mutation pools, where selective pressures enhance robustness through adaptive mutation selection. This mechanism signifies the ability of viruses to adapt to environmental changes with flexibility, thereby providing critical insights for predicting long-term viral evolution and pandemic emergence. A more thorough examination of the roles of mutations in the context of adaptive viral evolution in response to drug treatment is merited.
