Machine learning and SHAP values explain the association between social determinants of health and post-stroke depression | BMC Public Health

This study aimed to explore the association between SDoH and PSD. By analyzing clinical data from 1,112 stroke patients, we found that SDoH was significantly associated with the prevalence of PSD, and this relationship was validated in multiple subgroup analyses. Additionally, we developed a ML model to further demonstrate the clinical value of SDoH as a significant predictor of PSD occurrence.

The results of this study revealed a significant correlation between increasing SDoH and PSD. These findings align with current literature, underscoring the significant influence of social determinants, including socioeconomic position, educational attainment, and housing circumstances, on mental health. For instance, a systematic review has identified a significant association between reduced social networks and post-stroke depression [9], while meta-analyses have shown that social support can reduce the risk of PSD [23]. A cohort study also found that stroke patients with lower educational levels were more likely to develop PSD symptoms [10, 11]. These results suggest that adverse social environments may be associated with an increased occurrence of post-stroke depression through various mechanisms. Poor socioeconomic status, lower educational attainment, limited social support, and poor living conditions may all contribute as risk factors for stroke patients experiencing mental health challenges during recovery.

Our study found that younger individuals exhibited a higher propensity for developing PSD. Younger patients may experience greater socially adaptive problems, including employment constraints, family commitments, and insufficient social support, which could heighten their vulnerability to depression symptoms during stroke recovery. In contrast, while older patients may encounter more health problems, they may exhibit greater psychological resilience in coping with major illnesses. Additionally, our study revealed a relatively higher proportion of women with PSD; however, this difference was not statistically significant. A separate systematic evaluation also found no consistent relationship between depression and gender [24]. Furthermore, gender did not show a significant interaction in the analyses, as both male and female groups demonstrated a significant relationship between SDoH and PSD.

Although BMI is commonly used as an indicator of physical health in medical studies, our analyses revealed a strong association between BMI and depressive symptoms. This suggests a potential relationship between obesity and mental health. Obesity may not only cause physical discomfort but also exacerbate depressive symptoms by affecting an individual’s self-image and social interactions. A similar study identified a correlation between obesity and post-stroke anxiety [25].

In this study, we used the Boruta algorithm for feature selection while excluding known risk factors for PSD, such as race, smoking, and marital status. Although these factors have been widely recognized as risk factors for PSD, they demonstrated weak correlations with other selected features in our model and did not notably enhance predictive accuracy during model training. Specifically, the Boruta algorithm assesses the contribution of each feature to predictive performance using a random forest approach, and features are excluded if their importance is lower than that of randomly generated features. This selection method facilitates the enhancement of the model’s simplicity and generalizability, but we understand that this exclusion decision may lead to concerns regarding the validity of the model. Therefore, we plan to further investigate the potential role of these factors, especially in different subgroups, in future studies. Additionally, we will explore alternative feature selection methods and consider incorporating additional clinical variables to enhance the model’s predictive power.

Among the machine learning models for predicting PSD, the CatBoost model demonstrated the best overall predictive performance with an AUC value of 0.966, significantly outperforming the other models. Compared to traditional logistic regression, CatBoost excels at handling complex non-linear relationships, particularly in the analysis of multivariate medical data, demonstrating significant advantages [26]. Logistic regression, as a classical statistical model, provides good interpretability in cases with simple feature relationships. However, in the presence of multidimensional, non-linear data, machine learning methods can capture more potential patterns and relationships, thereby improving prediction accuracy [27].

Furthermore, DCA and calibration curves validate the potential clinical applications of the CatBoost model. Although the ‘black-box’ nature of machine learning models is often criticized, this study improved model interpretability by incorporating SHAP values. This allowed clinicians to visualize the role of each variable, thereby enhancing the model’s clinical applicability [28].

The use of SHAP values enabled a quantitative assessment of the role of individual characteristics in predicting PSD. In the CatBoost model, age, gender, SDoHQ, education, and BMI emerged as the most important predictors. The effects of age and gender are likely linked to known physiological and psychological mechanisms underlying post-stroke depression. The SDoHQ is a comprehensive measure of social health status that emphasizes the correlation between socioeconomic status and mental health [29]. This finding aligns with sociological theory, indicating a strong link between social factors and health outcomes [30].

The model proposed in this study offers a novel approach to PSD risk assessment in clinical settings. First, by quantifying the relationship between SDoH and PSD, clinicians can more precisely identify individuals at risk for early intervention. For patients with high SDoH scores, particularly those with low income, limited education, and insufficient social support, personalized mental health monitoring and intervention plans can be developed to help mitigate the onset and progression of depression. Second, the clinical application of this model may help reduce inequalities in PSD diagnosis and outcomes. By integrating SDoH factors, the models can elucidate the relationship between socioeconomic status and mental health, thereby aiding in identifying groups that are vulnerable to PSD due to poorer social conditions. Existing tools, such as the PHQ-9 and BDI, typically focus on assessing symptoms of depression, but they often overlook external factors such as social determinants of health. In contrast, our model can integrate these social and economic contextual factors to offer a more holistic risk assessment.

Although this study provides valuable evidence regarding the relationship between SDoH and PSD, several limitations remain. First, given that NHANES is a cross-sectional survey, and although we ascertained that subjects had a history of stroke based on self-reported data and classified depressive status using a PHQ-9 score ≥ 10, we could not definitively determine whether depressive symptoms manifested after the stroke. The diagnosis of new-onset post-stroke depression, in the strictest sense, requires validation through longitudinal cohort studies. Second, the CatBoost model demonstrated strong performance in this study (AUC = 0.966), but we recognize that its validation was restricted to internal ten-fold cross-validation, which may carry a risk of overfitting. Currently, no independent stroke-related datasets are available for external validation; thus, the generalizability of the model requires further evaluation, particularly across diverse populations and national datasets. Third, this study employed a PHQ-9 score ≥ 10 as the diagnostic criterion for PSD, which has been widely used in epidemiologic investigations. However, we acknowledge that some depressive symptoms (e.g., malaise, sleep disturbance) could result from somatic conditions in stroke patients, posing a risk of misclassification. Future studies could improve diagnostic accuracy through clinical interviews or detailed assessment of symptom onset timing. Fourth, this study employed a cross-sectional study design, which enabled us to identify only the correlation between SDoH and PSD, but precluded inferring a causal relationship between the two. Furthermore, although the NHANES database provides a wide range of representative data, there may be differences in socio-cultural backgrounds, healthcare systems, and economic conditions in different countries and regions, and these factors may affect the relationship between SDoH and PSD. Therefore, future studies should consider cross-national or cross-cultural samples to validate the generalizability and cross-regional applicability of the model and to ensure that it can be effectively applied to a wider population. Additionally, although this study combined machine learning models and SHAP values, there is room for further improvement, and future research should include additional factors related to PSD, such as a history of depression [31], stroke severity [32], and stroke location [33].

Continue Reading