Database construction and curation
The reference genome data for parasites were sourced from multiple publicly accessible genomic repositories, including the National Center for Biotechnology Information (NCBI) [16], WormBase [17], MalariaGEN [18], the European Nucleotide Archive (ENA) [19], and VEuPathDB [20]. In addition, comprehensive genomic resources were systematically curated through rigorous analysis of peer-reviewed publications that reported whole-genome sequencing assemblies and annotations for parasite species.
Following data retrieval, we implemented rigorous quality control procedures to verify data integrity and consistency, thereby eliminating low-quality or erroneous entries and ensuring overall reliability. For NCBI-annotated reference genomes, we systematically organized genomic metadata into a structured relational database and constructed search indices to enable computationally efficient queries. To construct a high-quality and nonredundant reference database, genome assemblies were screened on the basis of the following criteria: complete genome annotation for coding sequences, and accurate species-level taxonomic classification [21]. Redundant sequences were removed using CD-HIT (v4.8.1) [22] with a sequence identity threshold of 95%. Ambiguous or conflicting taxonomic labels were manually curated through literature review and cross-referencing with the NCBI taxonomy database.
To optimize query efficiency, the database was indexed using memory-mapped technology and structurally optimized to enable rapid large-scale data retrieval. Following its construction, the database was validated with sequencing data from reference samples to ensure accuracy and consistency. Given the dynamic nature of genomic data, the database is scheduled for quarterly updates. These updates will adhere to a standardized protocol involving automated data retrieval pipelines, multistage quality control measures, and peer-reviewed manual curation to preserve longitudinal data integrity and clinical relevance.
Data management
Efficient data management was essential to the development and deployment of PGIP for parasite genome identification. Upon submission, sequencing files (in FASTQ/FASTA format) were securely stored in a distributed file system; and systematically organized by project identifiers, sample metadata, and submission timestamps. This systematic structure streamlined retrieval and preserved data integrity during analysis. To safeguard sensitive data, all transmissions employed protocols such as HTTPS and AES-256 encryption, while role-based access control (RBAC) enforced strict privacy compliance. The platform adhered to a data retention policy under which analysis results were securely stored for 180 days prior to archiving; and accompanied by automated notifications to users before deletion and provisions for long-term export and preservation. This comprehensive data management strategy ensures that PGIP operates with high efficiency, accuracy, and data integrity; while providing reliable and reproducible results for parasite genome identification.
Design and workflow of PGIP
Data preprocessing
The PGIP supports the input of raw paired-end sequencing data in FASTQ format and preprocessed FASTA-formatted sequences from NGS platforms, including its compressed file (such as.gz,.tar). The maximum data size for each sample is 20 Gb. To ensure analytical accuracy, raw FASTQ inputs are subjected to stringent quality control (QC) prior to downstream analysis, including artifact removal and filtration of nontarget sequences. The standardized QC workflow is composed of three critical steps:
-
1.
Adapter removal: sequencing adapters which were introduced during library preparation are systematically trimmed to minimize platform-specific bias using Trimmomatic [23].
-
2.
Quality filtering: low-quality reads (Phred score < 20) and short fragments (< 50 bp) are filtered using Trimmomatic [23]. Quality metrics (e.g., per-base sequence quality, GC content) are visualized using FastQC [24] before and after processing to validate improvements.
-
3.
Host DNA depletion: reads are aligned to the host reference genome (e.g., GRCh38 for human samples) using Bowtie2 v2.4.5 [25] with sensitivity parameters (very-sensitive–local). Nonhost reads (unmapped reads) are retained for downstream pathogen analysis.
Parasite identification
Following QC, the cleaned data are automatically analyzed through identification modules within PGIP, which executes taxonomic classification and generates diagnostic reports. These modules utilize two identification methods: the identification of parasite genomes based on reads mapping, and the analysis of assembled data.
Reads mapping-based identification of parasite genomes
Kraken2 was used to construct a comprehensive reference genome database for the studied parasites [26]. Genome sequences were indexed using the Kraken2-build command to enable rapid sequence retrieval and alignment. The database was composed of a comprehensive collection of reference genomes for human and zoonotic parasites, such as helminths and protozoa. This taxonomic diversity ensured broad coverage of clinically relevant species, thereby enhancing the accuracy and reliability of parasite identification.
Species identification was performed using a Kraken2 k-mer-based alignment, which classifies sequencing reads against the reference database. Kraken2 segments each sequence into k-mers (contiguous nucleotide subsequences) and matches these to precomputed, taxon-specific k-mers in the database; thereby enabling fast and precise taxonomic classification. The database and its index file were memory-mapped to enable rapid access. Query sequences were split into k-mers and aligned to the reference database to assign taxonomic labels and calculate alignment counts. A hierarchical classification tree was constructed from the taxon-specific alignment scores, with the taxonomic lineage corresponding to the highest cumulative alignment score assigned as the definitive classification. This method also quantified the relative abundance of each parasite species within the sequencing dataset.
The Kraken2 output was composed of species identification results accompanied by detailed taxonomic information. Statistical analyses were performed to generate ecological indices; including species composition, paired read counts, and relative abundance.
Assembly-based identification of parasite genomes
The clean sequencing data were assembled using MEGAHIT [27], which constructs extended contig sequences through the iterative assembly of short reads. This assembler employs a multi-k-mer iterative strategy to construct simplified de Bruijn graphs (SdBG) through stepwise optimization cycles (k = 21–141 with 12 bp increments). During iterative assembly, smaller k-mers (21–129 bp) facilitated error correction and gap closure in low-coverage regions by filtering spurious connections and enhancing sequence continuity. Conversely, larger k-mers (141 bp) improved resolution of homologous repetitive elements through extended sequence context analysis [28]. Following each assembly iteration, systematic graph refinement procedures were implemented; including: (1) trimming terminal branches (tips) < 2 kbp, (2) collapsing parallel sequence variants (bubbles) with ≥ 95% similarity, and (3) eliminating graph edges which demonstrated local coverage below 2 × . These optimization strategies collectively generated high-fidelity contigs with enhanced structural integrity and sequence accuracy for downstream analyses.
Taxonomic binning was subsequently performed using MetaBAT [29], a probabilistic clustering tool that integrates contig abundance profiles and tetranucleotide frequency (TNF) patterns to reconstruct metagenome-assembled genomes (MAGs). Leveraging the taxonomically conserved nature of oligonucleotide composition in microbial genomes, MetaBAT first calculated the TNF-based probabilistic distances between contigs, which reflect sequence compositional similarity. Simultaneously, abundance profiles were derived from read alignment depths across samples, to capture genomic coverage variations indicative of population-specific replication rates. These two metrics were empirically weighted to construct a composite probabilistic distance matrix, which enables iterative hierarchical clustering of contigs through a graph-based algorithm. The resulting bins exhibited high phylogenetic resolution, with minimal cross-clade contamination, as validated by marker gene completeness and redundancy assessments.
Finally, taxonomic classification of MAGs was performed using the Contig Annotation Tool (CAT, v5.2) [30]. The CAT function classified long DNA sequences and MAGs by performing gene prediction, aligning open reading frames (ORFs) to the NR protein database, and the usage of a majority voting mechanism for taxonomic assignment based on individual ORFs. The resulting classification scores were analyzed to identify parasite species within the bins.
Integration of workflow and report output
Integrated analytical workflows were developed using Nextflow [31] to systematically execute multiple bioinformatics processes. To generate an identification report, a Python program was used to extract the 10 most identified parasites from the results; including the Latin names of the detected parasites, the number of detected sequences, and their relative abundance (Relative abundance = species-specific reads ×100/ total reads identified at the species level). The identification report also includes the data quality control results.
Evaluation of parasite identification
To evaluate the performance of PGIP, we selected a panel of public databases and in-house sequencing datasets representing clinically relevant human parasites. Parasite species were selected to ensure taxonomic diversity and include soil-transmitted helminths (e.g., Ascaris lumbricoides), food-borne parasites (e.g., Clonorchis sinensis), vector-borne parasites (e.g., Plasmodium spp.), and morphologically similar species from the same genus (e.g., Schistosoma japonicum and Schistosoma haematobium). We utilized sequencing data from diverse specimen types to assess platform performance under varying levels of host-derived contamination. These included stool sample (characterized by substantial background interference from host and microbial sources), blood sample (containing abundant host background), cerebrospinal fluid sample (with limited host content), parasitic sample (exhibiting minimal host interference), and amplicon sequencing sample (PCR-amplified parasitic gene fragments). Furthermore, a negative sample was included in the evaluation.
Public datasets were obtained from the European Nucleotide Archive (ENA). The remaining sequencing datasets were generated in-house as part of the parasitic disease surveillance project conducted at the Jiangsu Institute of Parasitic Diseases (for details, see Supplementary File S2: Datasets for evaluation of parasite identification).
The sequencing data were uploaded to the platform, and the default analysis workflow (read-mapping-based identification module) was executed. This approach directly maps high-quality sequencing reads to the curated parasite genome database, and is optimized for clinical and metagenomic samples without requiring genome assembly. The assembly-based mode is also available within the platform for users who wish to analyze preassembled contigs or scaffolds.