Diagnostic delay remains a critical barrier impacting the lives of those living with PCD. We demonstrated the feasibility of integrating confirmed patient records from disease registries into large health insurance databases with national-level coverage, enabling the development of ML systems to identify individuals at high risk for PCD (as an example of a screening tool for rare disease). This effort was made possible through a multidisciplinary effort led by patient advocates, researchers, and clinicians to develop a detailed knowledge engineered representation of the insurance claims profile of patients with PCD. While unvalidated, this work may serve as the basis for future ML efforts in rare disease detection.
We developed a screening cohort of cases in the claims database with diagnostic, drug, and procedural codes associated with PCD (Appendix 1). Analysis of this screening cohort identified clinical features of this cohort that clinicians may consider (Table 2). For example, clinicians who encounter patients with situs anomalies, including those with isolated congenital heart malformations, should screen for classic PCD-related symptoms, including year-round wet cough and year-round nasal congestion since infancy. When these issues are present, evaluation for cystic fibrosis (CF) may be necessary, but a diagnosis of PCD seems more likely to explain these symptoms, especially when CF newborn screening is negative.
Although we did not validate our model on an independent dataset or recall patients for diagnostic testing, we did evaluate its performance using several key metrics. We assessed positive predictive value and sensitivity within a fivefold cross-validation framework to measure how well the random forest model generalizes to unseen data. Notably, the inclusion of patients with Q34.8 and EM codes led to improved model performance (Fig. 3C), suggesting that expanding the positive case pool can help mitigate the challenges of imbalanced datasets. Important to note is that this likely led to the inclusion of false positives in the training set, and therefore is not a valuable strategy for building robust training sets in future studies. This highlights the critical need for gold-standard confirmed cohorts, as readily available claims data, while abundant, introduce significant diagnostic noise that can undermine model accuracy. This can be achieved through collaborations with patient organizations, medical centers, or potentially through AI-driven approaches, such as generating synthetic positive cases. Unexpectedly, ADASYN augmentation did not improve model performance in our tests, regardless of the threshold or training set composition, possibly due to the small size of the training data and complexity of the feature space (Fig. 3B, D).
Another way to assess model performance is to compare the number of patients predicted to have PCD by the model with the expected number of cases in the screening cohort. The model classified 7705 patients as positive, which aligns with 8667 patients with PCD we anticipated based on our initial calculations (Fig. 2). This result is promising; however, we cannot determine the true positive predictive value through actual PCD diagnostic testing since these patients remain deidentified.
We reviewed the relative importance of clinical features in the final model (Fig. 3D). Initially, features were treated without predefined weighting. Features selected more frequently and contributing more to impurity reduction were assigned higher importance scores. Reducing impurity improves the model’s ability to correctly classify patients based on their PCD likelihood. These top 10 features suggest that patients investigated for suppurative respiratory disease or using chronic therapies for this may have unrecognized PCD and should undergo PCD diagnostic testing (Fig. 4) [8]. The top feature was situs inversus, which led to an 8.76% reduction in impurity, compared to the average reduction of 1.56% across all features. This result is unsurprising, as slightly less than 50% of patients with PCD have situs inversus totalis, and an additional 12% have more complex laterality defects with situs ambiguus [19]. The emergence of hypertonic saline prescription as a prominent feature, despite its limited demonstrated efficacy in confirmed PCD cohorts, suggests its role as a surrogate marker for generalized chronic airway disease management in patients before or during their diagnostic journey. This highlights how claims-based features can capture aspects of clinical suspicion or treatment patterns rather than solely definitive interventions for a specific diagnosis. Finally, several features that we might expect to see contributing to positive predictions were not among the top, such as airway clearance or inhaled hypertonic saline, which we speculate may have resulted from manual grouping of codes to form the broad feature categories. For example, we may have instead grouped “inhaled hypertonic saline” with “DRUG-mucolytics” as a broad feature category. Modern deep learning models have the advantage of automatically inferring optimal feature sets and thus would avoid this issue [20].
Limitations
There are several important limitations to this approach. First is the small size of our training set expressed as a set of positive cases linked from the PCDFR. Given the small size of our training dataset (82 pediatric patients) relative to the over 55 genes associated with PCD and its phenotypic heterogeneity, it is possible that certain rare PCD subtypes, such as those with MCIDAS, CCNO, or FOXJ1 mutations and hydrocephalus as a predominant feature, are underrepresented or entirely absent, potentially impacting the generalizability of our findings. In addition, we utilized hashing technology to identify confirmed cases from the PCDFR in the claims database, which increases the overall accuracy of identifying patients but also incurs an increased risk of ‘collisions’ where separate patients are incorrectly conflated into a single patient record. We also assumed that the background cohort in the claims database was PCD-negative. Further, in accordance with patient privacy protections, we were unable to reidentify and confirm the presence or absence of a PCD diagnosis in the Q34.8 + EM sub-cohort of the positive class. This lack of individual-level confirmation introduces a limitation, as it is possible that EM was conducted due to suggestive history but ultimately yielded negative results, a scenario not captured in our population-level data.
The claims database on which we developed the training and screening cohorts was a unified, national-scale insurance claim database that provided widespread coverage to include many relevant populations, but did not capture neonatal populations. Given that situs inversus totalis with neonatal respiratory distress is sensitive and specific for PCD, future machine-learning methods should aim to include neonates in the study population. Medicare patients are not represented in the claims database. The inclusion of these populations may improve model performance in a general population [21]. There are also limitations inherent to the use of claims data. Notably, the presence of a procedure code does not guarantee a specific outcome or result, and the reporting of a drug code does not confirm if the prescription was actually filled and adhered to by the patient. Furthermore, claims data may lack granular clinical details, temporal information beyond service dates, and insights into patient behavior or lifestyle factors that could influence health outcomes.
This work serves as a foundational methodology, designed with a lightweight implementation to ensure it operates efficiently on a small-scale analysis platform. We used a tabulated approach that resulted in a reduced set of features that summarized patients’ clinical experiences in any given year and then used only the maximum feature value across all available years as the final feature used in the analysis.
Future directions
We demonstrated the feasibility of ML methods for patient screening for PCD based on national-level insurance claim data in the absence of an ICD code [22]. While the approach used closed claims data and manual feature categorization for a random forest model, future ML models could leverage more powerful algorithms, trained on hundreds of features, including time-series data, to further improve classification accuracy. Future efforts could explore the use of national electronic medical record (EMR) data to train neural networks, as this data more accurately reflects the clinical environments in which such screening tools will be applied and no longer require the curation of features by a clinical audience, which inherently presents challenges with selection of features to include or exclude, and can instead consider the totality of data [23, 24]. For example, we did not include pulmonary nontuberculous mycobacterial (NTM) infections (ICD-10-CM: A31.0) in the list of features for machine learning (Table 1). This was an unintentional omission that could be corrected in future work since isolated pulmonary nontuberculous mycobacterial (PNTM) infections are associated with PCD. An EMR-based approach would allow for the inclusion and automatic weighting of a much broader range of clinical variables, including those like asthma diagnoses, allowing the model to discern patterns without a priori human exclusion based on potentially outdated or evolving clinical paradigms. We believe this will be crucial for the development of even more robust and adaptable screening tools.
The key challenge when developing ML-based tools for rare diseases is the relatively small number of available patients. Patient-led organizations are making rapid strides towards the development and utilization of research-ready, rare disease patient data for natural history and clinical studies. The PCDF is one such example, establishing a clinical registry in 2020 to collect rigorous and detailed diagnostic and phenotypic data on individuals with genetically confirmed PCD through the PCDF Clinical and Research Centers Network, and expanding the PCDFR from approximately 150 patient participants at the time of linkage and analysis in this study, to now over 600 participants from 36 North American specialty centers accredited in diagnosis and management of patients with PCD. These efforts are providing crucial infrastructure to drive research partnerships and will be instrumental in the pursuit of improved screening, diagnosis, and care for the PCD community.
As patient organizations and their partners continue to develop comprehensive registries and datasets, there is a profound opportunity to scale ML-based approaches for screening many of the estimated 300 million people worldwide living with rare diseases. Once validated, these tools could be deployed in diverse clinical settings, including in international communities with significant disparities in access to diagnosis and care [25], enabling rapid identification of patients for referral and significantly reducing the time from first clinic visit to diagnostic testing. ML has the potential to transform the diagnostic landscape, bringing timely and accurate diagnoses to those who have long faced a complex diagnostic journey.