Genomic analysis indicates open pangenome of Brucella
To examine the genomic features of Brucella, we used 11 different Brucella species of publicly available genomes to characterize and define its pangenome. Core genes were defined as a set of genes shared with all 991 strains, accessory genes presented in at least two strain genomes or up to n-1 strain genomes (n: 991, total number of genomes), and the genes presenting only in one strain genome were defined as unique genes. The pangenome was constructed with the dataset consisting of 582 core genes, 4,121 accessory genes, and 2,462 unique genes from 991 Brucella spp. strains. Functional annotation of genes in the pangenome was conducted utilizing the Clusters of Orthologous Genes database (COGs), with the results revealing a distinctive distribution of functional categories across the three different pangenome subsets (Fig. 1A). Unique genes in the pangenome exhibited a higher ratio of genes annotated to the L category (Replication, recombination, and repair) than core and accessory genes (13.3% vs. 2.3% and 5.6%), with the majority of unique genes that belonged to the L category originating from B. inopinata (73.6%) (Supplementary Fig. 1). Furthermore, we found that the partial L category genes of B. inopinata were related to DNA modification, such as DNA adenine methylation, ISP type restriction/modification enzyme and DNA (cytosine-5-)-methyltransferase, indicating that B. inopinata may possess substantial epigenetic plasticity to aid in niche shift.
A The Brucella spp. pangenome can be divided into three gene subsets: (i) core gene (the gene is present in n genomes, where n represents the total number of genomes), (ii) accessory gene (the gene is present in a number of genomes ranging between 2 and n-1), and (iii) unique gene (the gene is exclusively present in a single genome). The annotation and categorization of genes were performed by the COGs. COG categories encompass the following: Information storage and processing, which encompasses J (Translation, ribosomal structure and biogenesis), A (RNA processing and modification), K (Transcription), L (Replication, recombination and repair), and B (Chromatin structure and dynamics); Cellular processes and signaling, including D (Cell cycle control, cell division, chromosome partitioning), V (Defense mechanisms), T (Signal transduction mechanisms), M (Cell wall/membrane/envelope biogenesis), N (Cell motility), Z (Cytoskeleton), W (Extracellular structures), U (Intracellular trafficking, secretion, and vesicular transport), and O (Posttranslational modification, protein turnover, chaperones); Metabolism, including C (Energy production and conversion), G (Carbohydrate transport and metabolism), E (Amino acid transport and metabolism), F (Nucleotide transport and metabolism), H (Coenzyme transport and metabolism), I (Lipid transport and metabolism), P (Inorganic ion transport and metabolism), and Q (Secondary metabolites biosynthesis, transport and catabolism); Poorly characterized, including S (Function unknown). B The pangenome accumulation plot showed the change in pan genes (purple) and core genes (cyan) with the increasing size of the number of genomes. C The scatter plot displayed the number of accessory genes (X-axis) and unique genes (Y-axis) possessed by 991 Brucella strains, with the color of the points representing the species of each strain.
Pangenome can be classified into two different types based on openness: open pangenome and closed pangenome. A highly open pangenome indicates that species inhabiting diverse environments possess the ability to exchange genetic material through various mechanisms, while species occupying the strained ecological niches find it challenging to acquire exogenous genes, resulting in a closed pangenome28,29. Heap’s law was used to characterize the pangenome openness of Brucella, and γ > 0 indicated an open pangenome for Brucella. For Brucella, we calculated that the relationship between pangenome size (P) and the number of genomes (N) was P = 1260.58*991^0.25, γ equals 0.25, and the pangenome size increased with the addition of new genomes (Fig. 1B), indicating that Brucella had an open pangenome and the pangenome repertoire of these 11 Brucella species was unable to be completely characterized using only the 991 genomes. Subsequently, we counted the number of accessory and unique genes within each strain of Brucella, and the result revealed that the majority of Brucella strains had a low number of unique genes, ranging from 0 to 100. Only five strains, including both two B. inopinata strains, had more than 100 unique genes. The number of accessory genes in most Brucella strains was around 2300. It is worth noting that both B. neotomae strains possessed a relatively low number of accessory genes, and the B. neotomae NCTC10070 strain had 1891 accessory genes, which is significantly lower than in other Brucella strains (Fig. 1C). This lower count is probably ascribed to the comparatively small size of the genome of this particular strain, which has a size of 2.9 Mbp.
Quantifying the zoonotic potential of Brucella spp. via phylogenetic analysis presents substantial complexities
For analyzing the genetic structure among the 991 Brucella strains, we used the core gene alignment result to construct an ML tree of all Brucella strains. In particular, we focused on whether phylogenetic analysis could separate classical zoonotic Brucella species from B. ovis, the only non-zoonotic species identified, and predict the zoonotic potential of other Brucella species accordingly. While Brucella species exhibited a close genetic relationship, they display notable variations in virulence, host preferences, and zoonotic capacity17. B. melitensis, B. abortus, and B. suis biovar 1 and 3 are the primary causative agents of brucellosis in both domesticated animals and humans30,31. Nevertheless, B. ovis infects animals exclusively and is the only known non-zoonotic Brucella species32,33,34,35. As shown in Fig. 2, B. abortus and B. melitensis each formed a highly differentiated clade, while the other clade consisted of a mixture of various Brucella species. The two atypical Brucella species, B. inopinata and B. vulpis, were distant from classical Brucella strains, and the phylogenetic relationship between the two species was also found to be quite distant. Zoonotic Brucella strains were clearly divided into three distinct groups, while the non-zoonotic species B. ovis were located in a separate clade. However, with the exception of these three zoonotic groups and the isolated clade of non-zoonotic species, it was difficult to determine the zoonotic potential of other species such as B. ceti and B. neotomae, based solely on phylogenetic analysis (Fig. 2).

This was a phylogenetic tree of 991 Brucella strains, reconstructed using a maximum-likelihood algorithm (raxml-ng) based on core gene alignment. Different species strains were labeled according to different colored dots. The red (zoonotic) and blue (non-zoonotic) background colors designated the zoonotic potential of Brucella strains.
Unsupervised ML algorithms are incapable of distinguishing between zoonotic and non-zoonotic Brucella strains
To acquire suitable input data for ML, we classified B. ovis, B. suis biovar 5, B. abortus 104 M strain, and B. suis S2 strains as non-zoonotic Brucella strains. Conversely, the zoonotic Brucella dataset encompassed all strains of B. melitensis, B. abortus, and part of the B. suis strains (B. suis biovar 1 and B. suis biovar 3), while excluding the aforementioned non-zoonotic strains. In total, 883 Brucella strains were manually annotated with zoonotic labels, among which 861 were classified as zoonotic pathogens and 22 as non-zoonotic pathogens. 268 genes with statistically significant associations to zoonotic or non-zoonotic phenotypes (P.adjust<0.05, Benjamini-Hochberg method) were identified as features through pan-GWAS analysis (Supplementary Data 1).
We initially employed the PCA algorithm to analyze data from all 991 strains, and PCA plots revealed the presence of non-zoonotic strains that were intermingled with zoonotic strains (Fig. 3A). To ascertain the optimal parameter for the K-Means model, the Elbow method and silhouette coefficient were performed to determine the optimal value for the number of clusters in the unsupervised ML algorithm K-Means, and the results revealed that the optimal number of clusters was three (Supplementary Fig. 2). The K-Means algorithm classified all strains into three clusters, although most non-zoonotic strains were split into cluster 3, while clusters 1 and 2 contained a mixture of both zoonotic and non-zoonotic strains (Fig. 3B). The performance of the DBSCAN clustering algorithm was inadequate in distinguishing between zoonotic strains and non-zoonotic strains (Supplementary Fig. 3). These findings implied that unsupervised ML algorithms encountered challenges in discerning the distinction between zoonotic and non-zoonotic strains with precision.

The red point and blue point represented zoonotic Brucella strains and non-zoonotic Brucella strains, respectively. A The PCA plot of the Brucella strains. PC1 and PC2 represented principal component one and principal component two, respectively. The green dashed line signified the admixture of zoonotic Brucella strains with non-zoonotic Brucella strains. B The K-Means plot of the Brucella strains. The border lines display the ranges of the three clusters.
The supervised ML models based on SVM algorithm can accurately distinguish between zoonotic and non-zoonotic strains
Due to the limited performance of unsupervised ML algorithms in distinguishing between zoonotic and non-zoonotic strains, five different supervised ML algorithms (RF, DTC, SVM, KNN, and MLP) were used to develop models for predicting the zoonotic potential of Brucella strains. Briefly, we divided 883 Brucella strains manually annotated with zoonotic labels into training and test datasets using random stratified sampling to ensure that both datasets maintained a proportional representation of zoonotic and non-zoonotic strains. The training dataset was then utilized to train each ML model and assessed the effectiveness of the models by stratified 3-fold cross validation. This entire process was repeated 100 times, and the results were averaged to obtain a more factual quality score for the models. The results showed that the models constructed with supervised ML algorithms achieved exceptional performance, with accuracy, recall, F1, and AUPRC mean scores all over 0.90, indicating that the supervised ML models were able to distinguish between zoonotic strains and non-zoonotic strains. Further analysis showed that the SVM ML model obtained the highest mean scores for all metrics except for the AUC and AUPRC score. Matthews correlation coefficient (MCC) score is a reliable metric for evaluating the quality of ML model in an imbalanced dataset. Among the five different ML models, the SVM model obtained the highest mean MCC score (0.78), and its MCC score distribution was more concentrated than that of the other ML models. These results suggested that SVM was the most appropriate algorithm for supervised ML model construction for Brucella zoonotic and non-zoonotic strain classification (Fig. 4A-F).

The performance of five algorithms was demonstrated by accuracy score (A), recall score (B), F1 score (C), AUC score (D), AUPRC score (E), and MCC score (F). The box plots marked the median, upper and lower quartiles, and 1.5× inter-quartile range (whiskers); outliers were shown as points (n = 100 for each violin plot).
The SVM prediction models accurately quantify zoonotic capacity of Brucella strains in the test dataset
As the most suitable ML algorithm, we retrained the models using the hyperparameters of SVM models with the top 10% MCC score previously. To predict the zoonotic potential of Brucella strain, the output results of these retrained SVM models were averaged to obtain decision value for quantifying the zoonotic potential of the strain.
The SVM prediction models were designed to generate decision values representing the zoonotic potential of Brucella strains on a continuous scale, with zero established as the threshold. This predictive framework allowed for the classification of zoonotic potential, where positive values indicated zoonotic potential and negative values denoted non-zoonotic potential. Moreover, the magnitude of the decision value reflected the strength of the corresponding classification, providing a quantitative measure of zoonotic potential. As shown in Fig. 5A, most of the Brucella strains from the zoonotic subset of the test dataset obtained decision values above +0.3, while certain B. suis strains demonstrated substantially lower decision values than others within the zoonotic subset. One strain of B. suis bv. 1 from the zoonotic test subset was erroneously assigned a decision value of -0.37 (below the threshold of zero), resulting in its misclassification as a non-zoonotic pathogen. (Fig. 5A).

The red and blue points represented zoonotic Brucella strains (A) and non-zoonotic Brucella strains (B), respectively. The position of the point on the X-axis represented the zoonotic potential of the strain, and the points to the right of zero were predicted to be zoonotic Brucella strains, while those positioned to the left were predicted to be non-zoonotic Brucella strains. The red and blue background colors indicated the zoonotic probability density and the non-zoonotic probability density, respectively. The intensity of the background color correlated with the magnitude of zoonotic potential.
In addition, the SVM prediction models accurately predicted the zoonotic potential of the non-zoonotic subset in the test dataset. The decision values for all non-zoonotic Brucella strains were below the threshold of zero. Notably, the SVM predictive models accurately predicted that the B. ovis had minimal zoonotic potential. B. ovis are generally recognized as non-zoonotic bacteria, exhibited decision values lower than -0.70, which further supported their classification as non-zoonotic bacteria. Two strains of B. suis bv. 5 were assigned decision values of -0.26 and -0.25, respectively, which were higher than those of B. ovis strains, indicating that their non-zoonotic potential was comparatively weaker than that of B. ovis strains (Fig. 5B).
The SVM prediction models exhibit strong generalization ability
Lipopolysaccharide (LPS) serves as a crucial virulence factor in Brucella, playing a pivotal role in the intracellular survival of Brucella within host cells. Brucella spp. are classified into smooth or rough phenotypes depending on whether O-polysaccharide (O-PS) is present on the LPS, and the rough phenotype Brucella strains frequently exhibit significantly attenuated virulence compared to the smooth phenotype Brucella strains, such as B. melitensis and B. abortus36. The natural rough phenotype of Brucella includes B. ovis and B. canis37. Although B. canis is currently the only species within Brucella spp. sharing the same LPS as B. ovis, there have been several reports of human infected by B. canis in recent years, indicating its zoonotic potential15,38,39.
While the SVM prediction models demonstrated excellent predictive performance in the test dataset, it is essential to further evaluate their generalization ability. B. canis, which is the only zoonotic species sharing the absence of O-PS with the non-zoonotic B. ovis species in Brucella spp., provided appropriate data for assessing the generalization ability of SVM prediction models. Therefore, we utilized the SVM prediction models to predict the zoonotic potential of 26 strains of B. canis. The mean decision value for B. canis was +0.49. Among the strains of B. canis, the highest positive decision value was +0.70. Notably, B. canis was not included in the training dataset of the SVM prediction models. The prediction results suggest that B. canis exhibits an intermediate zoonotic potential, which is in correspondence with the current understanding of this pathogen. The result indicated that the SVM prediction models demonstrated a strong generalization ability and can be utilized to estimate the zoonotic potential of other species strains within the genus Brucella (Fig. 6).

The color and position of the points indicated the species and predicted zoonotic potential of the strains, respectively. The red and blue background colors indicated the zoonotic probability density and the non-zoonotic probability density, respectively. The intensity of the background color correlated with the magnitude of the zoonotic potential. Animal silhouettes were obtained from PhyloPic (http://phylopic.org) under CC0 1.0 Universal Public Domain Dedication license.
Multiple various Brucella species display a wide range of zoonotic potential
The high similarity (~97%) at the genomic level within the genus Brucella allows for the utilization of our pan-GWAS-based SVM prediction models among different species of Brucella strains and various host-derived isolates of the same species. The marine mammal-derived Brucellae, including B. ceti and B. pinnipedialis, have been demonstrated to have potential zoonotic capacity40. Here, nine strains of B. ceti were evaluated for zoonotic potential using the SVM prediction models with an average decision value of +0.36. The highest decision value of +0.57 was obtained from a B. ceti strain isolated from a common bottlenose dolphin (Tursiops truncatus), while the lowest decision value of +0.10 was obtained from a B. ceti strain isolated from common dolphin (Delphinus delphis). Similarly, for B. pinnipedialis, the strains of this species exhibited an average decision value of +0.32. The highest decision value of +0.53 was observed in a B. pinnipedialis strain isolated from hooded seal (Cystophora cristata), while the lowest decision value of +0.11 was observed in a B. pinnipedialis strain isolated from harbor seal (Phoca vitulina). These findings highlight the necessity of surveilling zoonotic potential of marine mammal Brucellae. (Fig. 6).
B. neotomae, first isolated from desert woodrats (Neotoma lepida) in the United States in 1957, was previously considered a non-zoonotic bacterium. However, two cases of B. neotomae infected human were reported in 201714, indicating that humans may be part of its host range. The SVM prediction models predicted an average decision value of +0.17 for five B. neotomae strains. Among these strains, the highest decision value was at +0.26, and the lowest decision value was at +0.03 (Fig. 6).
B. microti is the only known species capable of surviving in soil within the Brucella spp.41. In this study, two strains of B. microti were predicted to have zoonotic potential, one strain was isolated from a vole (Microtus arvalis), while the other was found in a pool frog (Pelophylax lessonae). These strains had an average decision value of +0.37. Notably, the strain of B. microti derived from the vole exhibited a higher decision value of +0.54 compared to the strain isolated from the pool frog, which obtained a decision value of +0.21 (Fig. 6).
Apart from B. ceti, B. pinnipedialis, and B. microti, the atypical Brucella species include B. vulpis and B. inopinata10. Two strains of B. vulpis were assessed for their zoonotic potential by the SVM prediction models. The two highly similar strains of B. vulpis were isolated from red foxes (Vulpes vulpes) in 200842. As a result of their similarity, the decision values for both strains were identical, with a value of +0.15. This lack of variation in decision value made it impossible to construct a kernel density plot for B. vulpis.
A strain of B. inopinata BO1 was isolated from the breast implant wound fluid of a 71-year-old female patient in 200843, and its zoonotic potential was predicted to be +0.076 based on the SVM prediction models. Despite the B. inopinata 141012304 strain being isolated from a bluespotted ribbontail ray (Taeniura lymma), its predicted zoonotic potential was slightly higher at +0.080 compared to B. inopinata BO1 ( + 0.076) isolated from the human patient (Fig. 6).
B. suis is divided into 5 various biovars, with B. suis bv 1, 3, and 5 being used to construct the SVM prediction models. The zoonotic potential of B. suis bv. 2 was subsequently predicted using the SVM prediction models, with an average decision value of +0.26. Additionally, B. suis bv. 4 strains obtained an average decision value of +0.42. Two distinct strains of B. suis bv. 4 strains isolated from a reindeer (Rangifer tarandus) and a wild boar were assessed for their zoonotic potential, yielding decision values of +0.29 and +0.54, respectively (Fig. 6).
The zoonotic potential of Brucella strain is affected by both species and isolated host
Statistical analysis further corroborated that the SVM prediction models provided species-specific and host-specific zoonotic potential for Brucella strains. The results revealed that the zoonotic potential of B. abortus and B. melitensis is significantly higher than that of most other Brucella species. Notably, the SVM prediction models provided species-specific decision values for B. melitensis strains in the test dataset based solely on the gene features of the training samples, even in the absence of the input species information for all the Brucella strains, which is the most virulent Brucella species for humans32. Conversely, the zoonotic potential of B. ovis was significantly lower compared to B. abortus, B. melitensis, B. suis, B. canis, B. neotomae, and marine mammal Brucella spp. (Fig. 7A, B).

A Scatter plot showed the zoonotic potential of Brucella strains in the test dataset and various Brucella species strains derived from different hosts, the color of each point indicating the decision value for each Brucella strain. B Statistical significance of the decision value across 8 Brucella species. The single asterisk (*, yellow), double asterisk (**, blue), triple asterisk (***, red), and “ns” indicated statistically significant, very significant, extremely significant, and nonsignificant differences, respectively, from the permutation test. C Box plot showed the predicted zoonotic potential of B. melitensis strains derived from various hosts. Statistical significance of the decision values was determined by the Kruskal-Wallis test (**, P.adjust<0.01; ***, P.adjust<0.001). D Box plot showed the predicted zoonotic potential of the B. suis bv. 2 strains derived from swine or wild boar. Statistical significance of the decision values was determined by the independent samples t-test (**, P < 0.01).
Additionally, B. melitensis strains isolated from Homo sapiens exhibited significantly higher zoonotic potential than those isolated from Bos taurus, Capra hircus, and Ovis aries (Fig. 7C). Among the B. suis bv. 2 strains, the highest decision value was +0.65, isolated from a swine (Sus scrofa domesticus), while the lowest decision value was -0.04, isolated from a wild boar (Sus scrofa) (Fig. 6). The considerable difference between the highest decision value and the lowest decision value indicated a need to compare the zoonotic potential of B. suis bv. 2 strains derived from different hosts. We compared the decision values of Brucella strains isolated from different hosts and found that those isolated from swine exhibited significantly higher zoonotic potential than those isolated from wild boars, despite both belonging to the Sus scrofa species (Fig. 7D).
Model explanations reveal the genes potentially used to estimate the zoonotic potential of Brucella strains
Model explanations were conducted using the SHAP method on a set of 268 genes, which served as the features for constructing the SVM prediction models. The SHAP method offers two distinct perspectives for explaining the impact of features on ML model outputs: a global explanation and a local explanation. As shown in Fig. 8A, all 268 features were ranked according to the mean absolute SHAP values in the global model explanation, identifying the ten most influential features on the predicted results (Fig. 8A). Furthermore, the SHAP dependence plot showed how variations in feature values (gene present or gene absent) affected the predictions for Brucella strains. The feature group_2874 was the most significant contributor to the output of SVM prediction models, resulting in predictions towards zoonotic direction when absent and towards non-zoonotic direction when present, with a greater distance of the shift than that observed in the absence of group_2874. It has been preliminarily identified as an uncharacterized protein, and currently, there is no definitive functional characterization available. This indicated that group_2874 was more likely to be associated with a diminished zoonotic potential of Brucella strains (Fig. 8B).

A SHAP summary bar plot showed the top 10 features in descending order of the mean SHAP absolute values. B SHAP dependence plot showed the impact of 10 features on the decision value output of the SVM prediction models. Each dot represented a single Brucella strain in the test dataset, with red or blue indicating that the feature was present or absent in the genome of the Brucella strain, respectively. C SHAP force plot showed the impact of all features on the predicted results for each Brucella strain in the test dataset. The x-axis represented each sample, and the y-axis represented the predicted zoonotic potential of Brucella strains, with the larger part in red indicating greater zoonotic potential of the Brucella strains. D, E SHAP waterfall plot showed the local explanations for the typical zoonotic pathogen B. melitensis 16 M strain and the non-zoonotic pathogen B. ovis ATCC 25840 strain.
The predicted results for each Brucella strain in the test dataset were explained in Fig. 8C. The upper x-axis represented each sample, while the y-axis represented the contributions of features to the predicted results. As the blue part of the sample increased, the likelihood of the Brucella strain being pathogenic to humans was decreased (Fig. 8C). In addition, the local explanation of two Brucella strains, including the typical zoonotic Brucella strain (B. melitensis 16 M strain, GCA_000740415.1) and non-zoonotic Brucella strain (B. ovis ATCC 25840 strain, GCA_000016845.1), provided a personalized analysis of the predicted results. For instance, the absence of group_2974 resulted in an increase in the predicted zoonotic potential of the B. melitensis 16 M strain. In contrast, the presence of group_2974 in the B. ovis ATCC 25840 strain decreased its predicted zoonotic potential. The local explanation clearly demonstrated the roles that various features play in the zoonotic potential prediction process for individual Brucella strains (Fig. 8D, E).