Thank you. Listen to this article using the player above. ✖
Many medically relevant genes reside in “dark regions” of the genome that have long been elusive. To address this, we developed Paraphase – a computational tool that accurately resolves and analyzes paralogous genes. By unlocking the difficult-to-analyze regions of the genome where paralogous genes reside, Paraphase provides deeper insights into genetic variation, disease mechanisms and population diversity. This knowledge helps lay the groundwork for improved diagnostics, more inclusive reference genomes and future discoveries in genomic medicine. This multi-institutional study, led by PacBio, was published in Nature Communications.
Shedding light on the genome’s dark regions
We wanted to overcome a longstanding challenge in genomics: highly similar paralogous genes. These genes often reside within segmental duplications (SDs), which are large, repeated regions of DNA with nearly identical sequences. The repetitiveness of SDs complicates variant calling and copy number analysis, meaning traditional short-read sequencing technologies struggle to resolve these regions, leaving many genomic regions understudied and conditions undiagnosed.
To further our mission of accurately analyzing previously “dark” regions of the genome, we decided to design a tool for precise phasing and analysis of SDs with high accuracy and throughput. We also wanted to examine how copy number variations (CNVs) in certain paralogous genes differ across ancestries, and to show how this affects disease risk for different populations of people. We wanted to prove further how understanding genetic diversity such as copy numbers is key for building inclusive reference genomes and advancing equitable genomic medicine.
Paraphase uncovers genetic variations in segmental duplications in global populations
We developed a computational tool, Paraphase, to resolve segmental duplications (SDs) and allow us to accurately assess paralogs and copy numbers.
Before applying Paraphase to new data, we first validated the tool by applying it to known positive pathogenic samples and confirmed its accuracy. We then extended our analysis to 160 SD regions, spanning 316 genes. Samples came from 259 individuals across 5 ancestral groups: South Asian, European, African, Latin American and East Asian; the goal was to identify patterns of population-specific diversity and potential reference genome errors. Additionally, we examined 36 parent–offspring trios to detect de novo variants and gene conversion events.
The key findings of the study were:
- Paraphase enabled the analysis of medically important genes and associated diseases, such as those implicated in spinal muscular atrophy (SMN1/SMN2) and congenital adrenal hyperplasia (CYP21A2).
- We observed high copy number variability in many gene families within segmental duplications across people of different ancestries.
- We discovered a new approach for identifying false duplications in the reference genome.
- We identified 23 paralog groups with exceptionally low genetic diversity between genes and paralogs, indicating that frequent gene conversion and unequal crossing-over may contribute to similar gene copies.
Diverse genomic insights improve disease research and diagnosis
Our study demonstrates that using long-read HiFi sequencing in conjunction with our computational tool, Paraphase, provides a much richer and more detailed picture of genetic variation, specifically in complex SDs. By improving our ability to call disease-linked variants that are often missed by other technologies, Paraphase opens up new avenues for disease research.
For example, using Paraphase, we disentangled medically important gene families in a single test that have previously required specialized, multi-step assays. In the CYP21A2/CYP21A1P region – where mutations cause congenital adrenal hyperplasia – we characterized a previously overlooked duplication allele carrying both a functional CYP21A2 copy and a nonfunctional CYP21A2(Q319X) copy. Using standard tests, this duplication allele could easily have been misclassified.
Our study further highlights the power of long-read sequencing in detecting de novo variations, particularly in previously inaccessible parts of the genome. We uncovered seven previously undetected de novo single nucleotide variants (SNVs) and four de novo gene conversion events, two of which were non-allelic – a level of detail not possible with traditional sequencing approaches.
Additionally, our approach revealed high variation in copy number distributions across paralog groups in different ancestries. This finding reinforces the need for more genetically diverse reference genomes, as current references genomes are often biased toward European populations.
Paraphase provides a method for studying paralogous genes at scale, offering new opportunities for disease research, population-wide analysis and potentially even clinical testing. By broadening our understanding of genetic variation across ancestries, we can better understand how certain diseases impact specific populations, paving the way for more targeted diagnoses and treatment approaches.
By enabling more accurate identification of de novo variants and gene conversion events, our approach provides deeper insights into how genetic disorders arise and how traits are inherited. These discoveries offer a clearer view of genetic inheritance patterns and help reveal the underlying mechanisms of disease.
It should be noted that the current study focuses exclusively on gene families with fewer than 10 genes. Larger and more complex gene families were not included, meaning some medically important regions have yet to be studied. Additionally, the study is limited to assessing DNA-level variation in paralogs and does not explore transcriptomic or epigenetic factors, such as RNA expression or methylation differences between gene copies.
A broader lens: From genomics to multiomics
Looking ahead, we would like to extend Paraphase to study larger gene families, which were excluded from the current study. We’re also interested in applying Paraphase to investigate RNA-level differences and the transcriptional activity of paralogs that are very similar in sequence. It would be beneficial to explore epigenetic regulation with Paraphase, as it could provide further insights into how paralogous genes are controlled and expressed.
Reference: Chen X, Baker D, Dolzhenko E, et al. Genome-wide profiling of highly similar paralogous genes using HiFi sequencing. Nat Commun. 2025;16(1):2340. doi:10.1038/s41467-025-57505-2