Study selection
The search returned 543 records from the Embase database and 344 from the MEDLINE database (Fig. 1); 611 records remained for title and abstract screening after removing duplicates. Following this, 92 articles were excluded for reasons detailed in Fig. 1, leaving 519 records for full text screening. The full text could not be retrieved for one article and so this study was excluded. The main reasons articles were excluded at the full text screening stage were because QoL was not collected longitudinally (n = 153), no formal statistical analysis was performed (n = 36), or the QoL was not treated as the outcome measure (n = 14). A total of 271 articles met the eligibility criteria and were included in the review. Due to limited available resources, a pragmatic decision was made to initially select 15 articles to be second screened by three reviewers (including two reviewers screening the same five articles); discrepancies were found in two instances and were resolved, with the second reviewer agreeing with the original decision. Similarly, data were extracted from a further 15 articles by two reviewers (including both reviewers extracting the same five articles); discrepancies were found in < 6% of data points collected (one question = one data point). Discrepancies were the result of the clarity of reporting or misreading of articles and not due to a difference in interpretation of methods used. Again, after discussion, the second reviewer agreed with the original extraction. Based on the level of agreement observed and given the reviewers all agreed with the original screening/extraction, the team felt there was no need for further double screening or extraction.
Flow diagram of study selection. * including one instance where the full-text article could not be obtained and so the record was excluded
Fifty-six records appeared in the initial search but were not identified in the final search due to the database record being updated in the interim period: thirty-five of these records did not have either of the QLQ-C30 or QLQ-LC13 questionnaires indexed or mentioned in the title or abstract; nine did not have the required wording/indexing to identify them as an RCT or observational study; seven were classified as conference abstracts; four had the date of publication updated to 2023; and one record was removed from both MEDLINE and Embase databases.
Study characteristics
The characteristics of the eligible studies included in the review are presented in Table 1. The majority of included studies were parallel group RCTs (161/271, 59%), followed by cohort studies (84/271, 31%). The remaining studies were other RCT designs, single arm trials, or used individual patient data from multiple RCTs, multiple cohort studies, or combined data from cohorts and RCTs. The median number of patients included in each study was 201 (IQR 90 to 508). Quality of life was assessed at baseline in 96% (261/271) of included studies, and the median number of post-baseline/randomisation assessments was 4 (IQR 3 to 6). Most studies used just the QLQ-C30 questionnaire, none used just the QLQ-LC13 questionnaire, and 16 studies used both. In all cases the analysis method for the two questionnaires was the same. Therefore, without loss of generality, the results that follow refer to the QLQ-C30 questionnaire only. Very few studies specified a primary outcome that was a QoL score derived from the QLQ-C30 questionnaire (34/271, 13%). Of the 271 studies included in the review, 131 (48%) defined an MCID; the majority of these stated an MCID that related to within-patient change in QoL score (84/131, 64%), 28% (37/131) defined an MCID representing a between-group difference, and 8% (10/131) specified MCIDs for both within-patient changes and between-group differences.
Synthesis of results
Just over half of studies (138/271, 51%) analysed all scores derived from the QLQ-C30 questionnaire, 28% (76/271) selected a subset of scores and 4% (11/271) only analysed summary scores (Table 2). Around three-quarters of studies (207/271, 76%) did not use any methods to account for missing questionnaires or state the assumed missing data mechanism. Only 23 studies (8%) explicitly used a method to account for missing data due to death, such as joint modelling of the QoL score and survival or defining death as an event in a time to event (TTE) analysis. The remaining studies treated data truncated by death in the same way as other missing data.
Overall, the most utilized statistical model was a linear mixed effects model, with 45% (121/271) of included studies applying this approach in at least one analysis (Fig. 2). Following this, the models/methods used most often were TTE analyses (54/271, 20%), t-tests (44/271, 16%) and Mann–Whitney U/Wilcoxon rank-sum tests (38/271, 14%). Thirteen studies used unweighted GEEs (5%), nine used constrained longitudinal data analysis (3%), three used growth mixture models (1%), three used PMM (1%) and three used joint longitudinal survival models (1%). Just over two-thirds (189/271, 70%) of studies applied at least one longitudinal analysis method. Thirty-eight studies performed both longitudinal and cross-sectional analyses. Multiple cross-sectional analyses were undertaken in 98 studies (36%), and of these, adjustment for multiplicity was made in 10% (10/98). The most common method of multiplicity adjustment was the Bonferroni correction. Of the 175 studies that applied at least one longitudinal analysis model (excluding TTE analyses), 40 (23%) estimated treatment effects at multiple time points regardless of the statistical significance of the time x treatment interaction; 3/175 estimated treatment effects at multiple time points only if the time x treatment interaction was statistically significant (Table 2).

Analysis methods used in included studies. Method presented if used in > 1% of studies reviewed. LMM = linear mixed model, CLDA = constrained longitudinal data analysis, PMM = pattern mixture model, GEE = generalised estimating equations, JLSM = joint longitudinal survival model, GMM = growth mixture modelling
Table 2 summarises analysis methods by study type (RCT vs other). One-hundred and seventy-two studies were classified as RCTs and 99 were classified as other. Linear mixed effect models were used slightly more frequently in RCTs (81/172, 47%) compared to other studies (40/99, 40%). Similarly, TTE analyses, such as time to deterioration in QoL score by the MCID, were also more likely to be performed in RCTs compared to other studies, with 28% (49/172) of RCTs included in the review using this method, compared to 5% (5/99) of other studies. The use of the other commonly applied statistical methods (e.g., t-tests and Mann–Whitney U/Wilcoxon rank-sum tests) was comparable between RCTs and other studies.
Covariates or confounding variables, such as the baseline QoL score, were accounted for in at least one QoL analysis in 47% (75/161) of RCTs and 53% (50/94) of other studies (Table 2). Effect sizes and 95% CIs or standard errors (SEs) from at least one statistical model were presented in 56% of studies included in the review (99/172, 58% of RCTs vs 53/99, 54% of other studies). Overall, 40% of RCTs and other studies presented effect sizes and a measure of precision from all statistical models; 44% of studies did not present this information for any statistical analysis presented, either because the output from the chosen analysis did not include this information, or the authors chose not to present it.
Further details about the analyses performed are presented in Table 3. The majority of studies (248/270, 92%) analysed or used data from all time points collected, either in one or separate analyses. Most studies used the QoL score as the outcome measure (216/271, 80%), followed by change from baseline (56/271, 21%). Of those applying a longitudinal method of analysis (excluding TTE analyses), the majority fitted time as a categorical variable as recommended by the SISAQOL consortium (150/168, 89%). Of the 54 using TTE analyses, 16 (30%) used Cox proportional hazards models to estimate the difference between groups, compared to 32 (59%) using the log-rank test.
In relation to the specific challenges faced when analysing QLQ-C30, 165/267 (62%) of studies analysed a single item symptom score (a score that takes only four possible values, Table 3). Of these, 21% (33/158) used a method of analysis that accounted for the ordinal nature of the single item score. The remaining treated the score in the same way as other functioning or symptom scales analysed. Methods used included: non-parametric methods (26/33, 79%), logistic regression (5/33, 15%), ordinal logistic regression (2/33, 6%) and two-part models (1/33, 3%). Similar approaches were used to account for a possible peak at one end of the score distribution; methods used included non-parametric methods, ordinal logistic and logistic regression, two-part models, and GEE Tweedie models (Table 3). Few studies mentioned checking of the assumptions of their chosen analysis method and/or examining model fit (36/233, 15%).
Ten of the included studies cited the SISAQOL guidelines; 255/271 analysed at least one outcome that the SISAQOL consortium had published a recommended analysis method for. Just under one third of studies (73/255, 29%) followed the recommended approach for all outcomes, 17% (44/255) followed recommendations for some outcomes but not all, and 54% (138/255) did not follow the recommended analysis methods.
Of the 34 studies (17 RCTs, 17 other designs) that used QoL as a primary outcome, only 14/34 (41%) studies (7 RCTS, 7 other designs) clearly defined the QoL outcome of interest e.g. QLQ-C30 fatigue or global health status, and the relevant time frame (Table 4). The statistical analysis was appropriate for the specified estimand in 27/34 (79%) studies (13 RCTS, 14 other designs) and 15 of these 27 studies reported an effect estimate and measure of precision (5 RCTS, 10 other designs). Twenty-three studies reported deaths; nine included patients who died in their analyses, nine performed cross-sectional analyses and so excluded deaths if they occurred prior to the time point(s) analysed, and five excluded deaths from all primary outcome analyses.