We identified 560 papers in MEDLINE and 1087 papers in EMBASE and one paper was identified outside the formal search strategy. We removed 569 duplicates and excluded 994 papers by abstract and title screening. 84 papers were selected for full text screening. 25 papers7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31, comprising 41 prognostic models, were eligible for inclusion (see Fig. 1).
PRISMA flowchart of included studies.
Study populations and designs
15 studies (60%) were published since 20157,9,11,14,15,17,18,19,20,23,25,27,28,29,30,31, and one before 20108 (Table 1, Fig. 2). Most studies included European (40%)8,12,13,15,17,19,20,24,25,28, North American (12%)10,23,29, Australian populations (12%)11,16,22, or a combination of these (16%)14,18,26,27. 20 studies (80%) were prospective observational cohort studies7,8,9,10,11,13,14,15,16,17,19,20,21,22,23,24,27,28,29,31 and 7 studies (28%) were inception cohort studies14,15,19,20,24,27,28. Models from 7 studies (28%)14,15,19,20,27,28,29 had a defined time-point at which they could be used (i.e. at diagnosis or in early PD) (Table 1). 18 studies (72%)7,8,9,10,11,12,13,16,17,18,21,22,23,24,25,26,30,31 recruited PwP at various disease stages or did not define which PwP were recruited, so we were unable to identify which time-points in the disease course the models were designed to be used. However, one model23 recruited PwP with disease durations ranging from 0 to 30 years and included disease duration as a predictor variable in the model, so could potentially be used throughout the disease course if adequately validated.

Number of studies and models by years.
Outcomes of study
The most common prognostic outcome was falls/recurrent falls, which was predicted in 11 studies (44%)7,8,9,10,12,13,16,17,19,21,22. 7 studies (28%)12,18,19,23,27,28,31 predicted cognitive impairment/dementia, 4 studies (16%)12,15,25,26 predicted motor complications, 3 studies (12%)11,12,19 predicted freezing of gait, 3 studies (12%) predicted imbalance12,19,30, 2 studies (8%)18,20 predicted functional disability, 2 studies (8%)20,28 predicted a composite poor outcome, and single studies predicted depression14, mortality20, fracture risk24, difficulty doing hobbies19, and several other symptoms and signs12,29. The follow-up duration over which predictions were made varied from 3 months8 to 12 years20, most of which were <2 years (60% of models) and 4 studies18,20,25,28 had 5 or more years’ follow-up (Table 1).
Predictors in study
The number of predictors per model ranged from 3 to 998 (Table 1). 17 studies comprising 24 prognostic models (59%) used variables which were simple to collect in clinical practice, but 7 studies comprising 11 prognostic models (27%) included predictors that are not always routinely available in clinical practice, such as DAT imaging measurements, CSF biomarkers, or genetic polymorphism data (supplementary Table 1)13,14,18,23,25,27,31. In one study, 6 models (15%) were based on smartphone features and the corresponding app/analysis pipelines are not available for routine use in clinical practice19. 8 studies dichotomised or categorised continuous/discrete predictors7,10,12,13,17,22,24,31. Across 24 studies with 35 final models which specified the predictors, the most common predictors were age/age at onset (n = 25), sex (n = 15), and original or Movement Disorder Society Revision of the UPDRS (n = 12) (supplementary Table 2). In Fig. 3 we showed the percentages of predictors included in the models for the two most common outcomes (falls/recurrent falls [13 models] and cognitive impairment/dementia [7 models]). We question the usefulness of previous falls as a predictor of future falls, as was the case in 11 models7,8,9,10,13,17,21,22 because once PwP have started to fall, the fracture risk is already present and physiotherapy interventions for falls and balance are already indicated.

Proportion of models including the most commonly used predictors (data shown for the two most frequent model outcomes; variables appearing in less than a third of the models are not shown).
Study sample sizes
5 studies (20%) had fewer than 100 participants8,9,13,15,30 (Table 1). Only 4 studies (16%) had an events per variable (EPV) of at least 1010,17,18,20 (Table 1), the usual rule of thumb for minimum EPV required for Cox or logistic regression modelling32, and many of the other studies had EPVs much less than 107,8,9,11,13,14,16,19,25,27,28,30,31. 4 studies (16%) didn’t give information about the number of events18,24,26,29 (Table 1).
Model development
12 studies (48%) did not provide information on the number of participants lost to follow-up9,10,11,12,15,18,20,22,24,25,26,29,31 and 11 studies (44%) didn’t report the number of participants with missing data9,11,12,15,16,17,21,22,24,26,31 (supplementary Tables 3 and 4). 10 studies (40%) gave full information of missing data (number and imputation method)7,10,13,14,18,23,25,27,28,29. The most common method of handling missing data was complete case analysis (28%)7,10,13,15,18,25,29. 2 studies (8%) handled the missing data with multiple imputation14,28 (Table 2). 8 studies (32%) transformed continuous predictors into dichotomous or category variables7,10,12,13,17,22,24,31 and 10 studies (40%) selected predictors by univariable analysis7,9,13,14,16,20,22,25,27,31 (supplementary table 1 and 5).
12 studies (48%) used logistic regression8,9,10,11,13,14,16,17,21,22,27,28 and 3 studies (12%) used machine learning (decision trees, XGBoost, and random forests) to build the prognostic model12,14,19. None of the machine learning models reported key predictor importance (e.g., SHAP values) or provided sufficient details for independent validation.8 studies (32%) didn’t account for censoring and simply excluded censored participants in the analysis8,13,14,16,17,21,27,28. 10 studies (40%) used time-to-event survival analysis to build the prognostic models: 6 studies used Cox regression7,15,24,25,26,31. Other studies used a frailty Cox model18,23, Weibull parametric survival model20 and a dynamic prediction model29 (Table 2). Three studies reported checking the proportional hazards assumption in survival analysis7,18,20 (Table 2 and supplementary table 5).
Model evaluation and performance
Two studies10,17 that aimed to externally validate previously published models did not use the original model equation to make predictions for PwP in their validation dataset3. Therefore, these 2 studies10,17 were not truly external validation studies. We classed these studies as model development in the PROBAST assessment (Tables 1 and 3).
Internal validation and model equation assessment only applies to model development studies (n = 24) (Table 1). 7 studies (28%) didn’t perform internal validation8,9,10,11,17,21,26, 7 studies (28%) didn’t provide clear information about whether internal validation had been applied in all model development procedures or not13,14,16,23,24,29,31, and 3 studies (12%) used split data methods14,27,30 (supplementary Table 6). 15 studies (60%) used cross-validation or bootstrap resampling to assess optimism in model performance7,12,13,16,18,19,20,22,23,24,25,27,28,29,31 (supplementary Table 6). Only 3 studies (12%) performed both internal and external validation after model development18,20,28 (supplementary Table 6). One study18 didn’t give the number of events in the development and validation datasets (Table 1).
3 studies (12%) didn’t evaluate model performance8,12,21 (supplementary Table 7). 12 studies (48%) reported internal discrimination performance but did not report calibration performance7,13,16,18,19,23,24,25,26,29,30,31 and one external validation study15 reported the discrimination performance without reporting calibration (Table 2). 6 studies (24%) used the Hosmer-Lemeshow goodness-fit-test to assess the internal calibration performance9,10,11,17,22,27 (supplementary Table 7). One study (4%) used both calibration plot and slope to present models’ internal and external calibration performance28, one study (4%) used calibration plot to present models’ internal and external calibration performance20 and one study (4%) used calibration plot to present models’ internal calibration performance14 (supplementary Table 7).
Model reporting
9 studies (36%) including 13 models (32%) gave sufficient information for the models to be used in clinical practice11,14,18,20,26,27,28,29,30 (Table 2). 10 studies (40%) did not report the intercept or baseline hazard7,8,9,10,13,17,18,22,25,31. 5 studies (20%) did not provide the model equation or sufficient details to replicate the model12,19,21,23,24 and one study provided a plot of estimated coefficients instead of giving specific values16.
Risk of bias/applicability
We found 8 studies (32%) which had inclusion and exclusion criteria that would be broadly generalisable to unselected populations with PD14,15,18,19,20,24,27,28 (supplementary Table 8), which had low concern of applicability (supplementary Table 9). 16 studies (64%) lacked details of important aspects of study design (e.g. recruitment methods/dates, diagnostic criteria)7,8,10,11,12,13,16,17,21,22,23,25,26,29,30,31 and 7 studies (28%) had selection concerns that could bias the studies towards healthier participants (e.g., excluding on the basis of comorbidities, older age) raising concerns about generalisability or risk of bias7,8,9,16,17,30,31 (supplementary Table 8, 9 and 10).
Supplementary Table 11 contains the risk of bias results relating to the predictors studied. One study (4%) had risk of bias in the predictors as they used a retrospective cohort without stating how subjective predictors (e.g., depression, olfactory dysfunction) were measured25. 7 studies (28%) included predictors that may not be routinely available in clinical practice, such as CSF biomarkers or imaging data13,14,18,23,25,27,31 so these models may not be feasible in clinical practice, especially in resource-poor settings.
For the risk of bias relating to the outcomes in studies, one study (4%) had unclear risk of bias as it didn’t state the outcome definition12 (supplementary Tables 12 and 13). Outcome definitions in 2 studies (8%) may have been biased by determination with knowledge of predictor information as the outcome definitions were subjective19,25 (supplementary Tables 12 and 13).