Participants’ demographic characteristics
Among the 1,572 participants included in this survey, 852 were females (54.2%) and 720 were males (45.8%), indicating a slightly higher proportion of females. Most participants were in the younger age group (30 years and below), totaling 1,266 individuals (80.5%). Most participants held a bachelor’s or higher degree, accounting for 1,303 individuals (82.9%). In terms of occupational distribution, 643 participants (40.9%) were engaged in mental or physical labor, 591 (37.6%) were students, and 392 (21.5%) were unemployed or having other occupations, students and mental laborers constituted the majority part of participants in this study. A total of 267 participants (17.0%) reported a history of previous illnesses, with the highest prevalence being GI diseases, affecting 98 individuals (6.2%). Additionally, 299 participants (19.0%) were smokers, and 96 participants (6.1%) reported regular alcohol consumption. Refer to Table 1 for specific details.
Comparative analysis between variables and the risk factors of OCD
In this survey, a total of 478 individuals (30.4%) had OCI-R composite scores > 27, indicating a high risk of OCD. Among demographic characteristics, males had a higher proportion in the high-risk group compared to females (χ² = 8.230, P < 0.005), suggesting that gender may play a role in the development of OCD. Regarding education background, respondents with a bachelor’s degree were more prevalent in the high-risk group (χ² = 11.415, P < 0.005), indicating a possible association between educational attainment and OCD risk. Furthermore, individuals with a previous medical history exhibited a significantly higher risk of OCD compared to those without (χ² = 76.275, P < 0.001). No statistically significant differences were observed with respect to age and occupation. In terms of lifestyle factors, smokers were more prevalent in the high-risk group (χ² = 48.954, P < 0.001), indicating that smoking may increase the risk of OCD. Similarly, respondents who consumed alcohol regularly had a higher risk of OCD (χ² = 17.335, P < 0.001). Sleep quality was significantly poorer in the high-risk group (χ² = 74.856, P < 0.001). Significant differences were also found in GI symptoms, frequency of GI symptoms, abnormal bowel movements, and frequency of abnormal bowel movements (all P < 0.05), while no significant differences were observed in early awakening sleep disorder and bowel movement frequency. Regarding dietary habits, significant differences among OCD risk groups were found in regular diet, daily meal times, eating speed, picky eating, overeating, food allergy, and preferred food temperature (all P < 0.05). Significant differences were also observed in the consumption frequency of fresh fruits, fresh vegetables, pickled vegetables, meat, haslet, processed meat, fried foods, leftover foods, and beverages (all P < 0.05). However, no significant differences were noted in the consumption frequency of whole grains, poultry, seafood, nuts, tubers, eggs, dairy products, and legumes. Refer to Table 1 for detailed information.
Assessment of multicollinearity among independent variables
Multicollinearity among independent variables was then assessed. A Tolerance value < 0.100 and a VIF > 10.000 were used as indicators of multicollinearity. The results showed that all variables had Tolerance values > 0.100 and VIF values < 10.000, suggesting that multicollinearity did not significantly affect the model estimation. Refer to Supplemental Table 1 for detailed information.
Variable selection via Lasso regression
Lasso regression was applied to screen the initially included variables. Figure 1A and B depict the relationship between the coefficients of the 63 variables and the Lambda values. The vertical dashed line in both figures represents the optimal Lambda value (0.00947), selected via 10-fold cross-validation. In Fig. 1B, variables positively and negatively associated with the risk of OCD are shown in red and blue, respectively, while variables with coefficients shrunk to zero are shown in gray. These gray variables were considered non-informative and excluded from subsequent analyses. The results identified 26 variables positively associated with the risk of OCD, including: frequency of abnormal defecation, burning sensation, overeating, dreaminess, light sleep, sour regurgitation, alcohol drinking, fruits, diarrhea, picky eating, other-colored stools, vegetables, restless sleep, insomnia, cardiopulmonary diseases, frequency of GI symptoms, abnormal defecation, GI symptoms, animal meat, sleep disorders, smoking, medical history, coronary heart diseases, red or black stools, nausea and vomiting, and GI diseases. An additional 10 variables were found to be negatively associated with OCD risk: age, daily meal times, haslet, defecation frequency, regular diet, fried food, pickled vegetables, beverages, stomachache, and diabetes. Variables with zero coefficients were excluded from further statistical analysis. Refer to Fig. 1C for detailed information.
Lasso regression feature selection (A–B) Relationship between Lasso regression coefficients and lambda values for OCD risk factors (C) Lasso regression coefficients of risk factors at the optimal lambda value
Logistic analysis of risk factors for OCD
The 36 variables selected through Lasso regression were included in a logistic regression model for further analysis. Model calibration was first assessed using the Hosmer-Lemeshow goodness-of-fit test, which yielded χ² = 17.335 and P = 0.745, indicating a good fit between the observed and predicted values. This result indicates a good model fit, with no significant difference between the predicted and observed values. Subsequently, the model’s discriminative performance was evaluated using the AUC, which was 0.759 (0.689–0.788). As the AUC exceeded 0.75, the model demonstrated acceptable discriminative ability. Refer to Table 2 for detailed information.
Under the same conditions, compared to individuals without a previous medical history, those with a previous medical history had an average 70.9% increase in the likelihood of developing OCD (OR = 1.709, 95% CI: 1.064–2.738). In contrast, individuals with a history of diabetes had an average 77.4% decrease in the likelihood of developing OCD (OR = 0.226, 95% CI: 0.096–0.512) compared to those without. Likewise, compared to individuals without a history of sleep disorders, those with a history of sleep disorders had an average 46.0% increase in the likelihood of developing OCD (OR = 1.460, 95% CI: 1.005–2.137). Similarly, compared to people without a history of insomnia, those with insomnia had an average 60.7% increase in the likelihood of developing OCD (OR = 1.607, 95% CI: 1.015–2.137). In similar situations, compared to individuals without GI symptoms, those with GI symptoms had an average 56.2% increase in the likelihood of developing OCD (OR = 1.562, 95% CI: 1.057–2.311). Additionally, participants reporting nausea and vomiting symptoms exhibited a 75.1% increase in OCD likelihood compared to those without such symptoms (OR = 1.751, 95% CI: 1.162–2.643). In a similar vein, compared to individuals without red or black stools, those with red or black stools had an average 50.3% higher chance of developing OCD than those without these symptoms (OR = 1.503, 95% CI: 1.142–1.978). What’s more, picky eaters had a 41.4% higher likelihood of developing OCD compared to non-picky eaters (OR = 1.414, 95% CI: 1.037–1.925). With respect to meat consumption, those who occasionally or rarely ate meat were 81.8–102.9% more likely to develop OCD than individuals who frequently consumed meat (OR = 1.818, 95% CI: 1.401–2.360; OR = 2.029, 95% CI: 1.238–3.327). Conversely, individuals who occasionally or rarely drank beverages were 29.1–41.0% less likely to develop OCD than those who frequently consumed beverages (OR = 0.709, 95% CI: 0.511–0.983; OR = 0.590, 95% CI: 0.403–0.865).
The associations between risk factors and OCD
Based on the results of Lasso regression, 36 variables with statistically significant associations were used as input features for three machine learning models: SVM model, RF model, and BP Neural Network model. The outcome variable was the risk of OCD. The dataset was randomly split into training and testing sets at a 7:3 ratio, resulting in 1,100 participants in the training set and 472 participants in the testing set. To begin with, the performance of the three machine learning models was compared, with the ROC curves presented in Fig. 2. Based on the AUC and several other stability metrics (see Table 3 for details), the RF model demonstrated the best overall performance among the three. Furthermore, 10-fold cross-validation was conducted to evaluate the generalizability and stability of the models. After comparing the results of three models, the RF model exhibited the best generalizability and stability with AUC ranged from 75% (95% CI: 72%−78%) to 71% (95% CI: 69%−74%). Refer to Supplemental Fig. 1A-C for more details.

ROC curves of the three machine learning models
In term of feature importance, the SVM model identified the top 20 most influential factors associated with the risk of OCD, ranked from highest to lowest in importance, were: smoking, medical history, overeating, defecation time, frequency of abnormal defecation, daily meal times, abnormal defecation, restless sleep, age, burning sensation, sleep disorders, dreaminess, regular diet, diabetes, frequency of GI symptoms, light sleep, pickled vegetables, diarrhea, fruits, other color’s stools (Fig. 3A, Supplemental Table 2).

The normalized importance of predictor variables (A–C) the normalized importance of predictor variables for Support Vector Machine model (A), Random Forest model (B), and BP Neural Network model (C)
Similarly, in the RF model, the top 20 most influential factors associated with the risk of OCD, ranked from highest to lowest in importance, were: abnormal defecation, coronary heart diseases, cardiopulmonary diseases, burning sensation, sour regurgitation, diabetes, restless sleep, stomachache, age, other color’s stools, diarrhea, GI diseases, red or black stools, insomnia, overeating, nausea and vomiting, defecation times, haslet, smoking, and medical history (Fig. 3B, Supplemental Table 3).
In addition, in the BP Neural Network model, the top 20 most influential factors associated with the risk of OCD, ranked from highest to lowest in importance, were: animal meat, beverages, particular about food, overeating, frequency of abnormal defecation, regular diet, restless sleep, fruits, other color’s stools, sour regurgitation, nausea and vomiting, insomnia, daily meal times, red or black stools, vegetables, sleep disorders, diarrhea, medical history, fried food, and pickled vegetables (Fig. 3C, Supplemental Table 4).