Study design and data
The current report adheres to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement for cohort studies [15].
This study is a registry-based retrospective cohort study that was conducted as part of the analytical tasks of the I-CARE4OLD project, a European Union (EU) funded program aimed at improving prognostication in older adults with complex chronic conditions through the use of ML methods [16]. The study was approved by the Finnish Institute for Health and Welfare (THL) (permission no. THL/1118/6.02.00/2021). The data sources were the RAI-LTC (Resident Assessment Instrument for Long-Term Care) based comprehensive geriatric assessments of LTCF residents and the Finnish Care Register for Health Care. Trained assessors, usually registered nurses, collected data using the Minimum Data Set (MDS) 2.0 version of the RAI-LTC instrument. All LTCF residents in Finland are regularly assessed with this instrument at least twice per year, as defined by the Elderly Care Act 980/2012. RAI assessments are delivered twice a year to the THL either by the service provider themselves or by their authorized application provider. The national RAI database is maintained by THL, responsible by legislation for keeping social and health records based on the Act on the Institute of Health and Welfare 31.10.2008/668. The validity and reliability of the RAI-LTC instrument has been demonstrated in previous studies [17].
Data collected over the years 2014 to 2018 were used in the present study. The RAI-LTC instrument collects information on each resident’s demographic, functional, medical, and cognitive status and drug prescriptions. Several scales to measure clinically relevant indicators such as cognition (cognitive performance scale, CPS) [18], self-function (activity of daily living, ADL) [19], behavior (aggressive behavior scale, ABS) [20], or depression (depression rating scale, DRS) [21] are embedded in the instrument.
Data preprocessing steps and the operational definitions of variables and scales are described in details in the supplementary material (see Additional file 1) [22,23,24,25,26].
Definition of study groups
Antipsychotic use was identified from the RAI-LTC section dedicated to drug prescription using the ATC code N05A, excluding N05AN. According to previous estimates from RAI data, the overall prevalence of antipsychotic use among LTCF residents in Finland ranges from 28% to 35% with atypical antipsychotics being the most frequently prescribed agents and risperidone accounting for the majority of prescriptions followed by quetiapine and olanzapine [27]. Residents who were 65 years of age or older were selected for this study. To be included in the study, residents had to have at least four consecutive RAI assessments, each conducted at 6-month intervals. Time period of assessments 1 and 2 was defined as the baseline period. Follow-up period started from the third assessment. Residents were classified in the discontinuing group, if antipsychotic medications were prescribed at the baseline period (assessments 1 and 2) but not at the follow-up period (assessments 3 and 4). Residents were classified in the group of chronic users, if antipsychotic medications were prescribed both at the baseline and follow-up period (assessments 1 to 4). The input variables (or candidate predictors) in the models were collected from the second assessment of the baseline period. For residents with more than one valid group of four RAI assessments, the assessment group was randomly selected. Those residents who died during the study period were excluded from the analyses. Definition of study groups is further described and illustrated in Additional file 1: Fig. S1.
Definition of study outcome
The main outcome in this study was hospitalization for any cause. Information on hospitalizations was obtained from the Finnish Care Register for Health Care. The operational definition adopted was any number of hospitalizations within 360 days of the first follow-up assessment (i.e., the third assessment) and it was categorized on a binary scale (yes/no).
Individual treatment effect models
A causal ML approach was adopted to assess the effect of antipsychotic discontinuation on the risk of hospitalization. Causal ML models, particularly ITE models, estimate how an intervention would affect outcomes at the individual level. Unlike standard supervised ML models, which predict the risk of an outcome, ITE models aim to estimate the causal effect of a treatment. This makes them well-suited for evaluating pharmacological and non-pharmacological interventions in older adults with complex chronic conditions, as they account for patient heterogeneity and enable personalized effect estimates. The ITE when antipsychotic medication is stopped can be represented by equation:
$$tau (x)=Eleft[{Y}_{i}left(1right)-{Y}_{i}left(0right)|{X}_{i}=xright]$$
(1)
where Yi(1) and Yi(0) are potential outcomes [28] after the medication is discontinued or continued and Xi are the covariates of resident i. This measure (ITE) can be interpreted as the absolute risk reduction (ARR). For example, it can be interpreted that (widehat{tau }<0) indicates that discontinuing antipsychotic medications reduces the risk of hospitalization, while (widehat{tau }>0) implies an increased the risk. Currently, there is no generally accepted standard algorithm for estimating ITE. Therefore, we used several different algorithms (DML, DR-learner, X-learner, and causal forest) and compared their estimates. For training and evaluating causal ML models, the dataset was split into the training/validation set and test set. The split was based on the index day (June 1, 2016), which divided the data set in the ratio of 70% for training/validation (before the index date) and 30% for testing (after the index date). The parameters of the models were searched and the models were trained in the training/validation set. Then, the trained models were evaluated on the test set. The workflow of causal ML model training and evaluation and confounder selection is illustrated in Fig. 1.
Workflow of ML model training and evaluation
Confounders
From an overall set of 298 variables available from the data source, a subset of potential confounders was selected using both a data-driven and knowledge-based selection approach. A complete list of processed variables has been included in Additional file 1: Table S1. Based on data-driven approach, candidate confounders were searched through univariate logistic regression models that were trained for predicting hospitalization and exposure. The relevance sorting criterion was the area under curve (AUROC) value for both outcomes. A group of three study researchers (DF, HF, RL), who are experts in the field of clinical geriatrics and clinical pharmacy, reviewed the list of potential confounders and included additional variables that, although not considered relevant based on the logistic regression, were considered potential confounders because they were deemed good proxies for unmeasured factors associated with antipsychotic discontinuation and influencing the probability of being hospitalized. The final list of potential confounders included: age, gender, body mass index (BMI), number of medications, number of comorbidities, cognitive decline (CPS score) [18], functional status (ADLH score) [19], depression (DRS score) [21], presence and severity of behavioral symptoms (ABS score) [20], delirium symptoms, delusions or hallucinations, unsteady gait, acute episode or flare-up of recurrent or chronic problem or monitoring acute medical condition, recent hospital visits or emergency department visits, chemotherapy or end-stage disease, problems with eating and swallowing, any restraints used, physician visits in last 14 days or doctor orders changed or abnormal laboratory tests. Detailed definitions of confounders can be found in Additional file 1: Table S2.
Model evaluation
A fundamental problem in evaluating causal inference models is that a given individual can never be observed in both treated and untreated conditions (Eq. 1). Therefore, metrics that calculate the difference to true treatment effect can only work in a simulation where you know both possible outcomes. However, we can accept that if model found heterogeneity in the data, then model-assisted recommendations are better than random treatment assignment. In this study, we used the area under uplift curve (AUUC) [29] and c-for-benefit [30] metrics to verify this property. Both metrics have been increasingly adopted in the literature to evaluate the discriminative ability of ITE models [31]. Furthermore, we analyzed treatment effect distributions for presenting information about what models have learned from data and conducted a set of sensitivity analyses.
Model interpretation
For the interpretation ITE models, we used SHAP values (SHapley Additive exPlanations) [32, 33], partial dependence plots (PDP) [34], and surrogate models [35]. PDP plots were calculated for the variables with the highest absolute sum of SHAP values. Our surrogate models were decision trees that were trained to predict the estimations of the trained ITE models when the input were the confounders.
Software
All analyses were performed using Python version 3.9.7 and the following libraries: Scikit-learn package [36] version 1.0.2 for all data processing steps, EconML [37] version 0.14.0 for ITE models, and SHAP [32, 33] version 0.40.0 for model interpretation.