Author: admin

  • Journal of Medical Internet Research

    Journal of Medical Internet Research

    Age-related macular degeneration (AMD) is a progressive retinal disorder affecting millions of people worldwide []. In its advanced stages, characterized by neovascularization and geographic atrophy (GA), it can lead to significant vision loss, although symptoms may be subtle during the early and intermediate phases []. The Classification of Atrophy Meetings group has defined atrophy lesion development as incomplete retinal pigment epithelium (RPE) and outer retinal atrophy and complete RPE and outer retinal atrophy (cRORA) based on imaging methods []. GA, also known as cRORA, is the endpoint of dry AMD and is characterized by the loss of photoreceptors, RPE, and choriocapillaris [,]. With the advent of 2 approved therapies for GA secondary to AMD in 2023, namely pegcetacoplan (Syfovre) [] and avacincaptad pegol [], the treatment of GA represents a significant breakthrough. However, the effectiveness of these therapies relies heavily on early detection and the ability to monitor treatment response—a significant unmet need in current clinical practice. The recent approval of complement inhibitors underscores the necessity for precise, reproducible, and practical tools to not only identify GA at its earliest stages but also to objectively track morphological changes over time, thereby evaluating therapeutic efficacy [,]. Artificial intelligence (AI) is uniquely positioned to address this gap by enabling precise, reproducible, and automated quantification of GA progression and treatment response using noninvasive imaging modalities []. Unlike conventional methods that rely on subjective and time-consuming manual assessments, AI algorithms can detect subtle subclinical changes in retinal structures—such as photoreceptor integrity loss, RPE atrophy, and hyperreflective foci—long before they become clinically apparent. Thus, AI-based retinal imaging offers a critical foundation for early detection and timely intervention in GA.

    Various imaging techniques, both invasive and noninvasive, can directly visualize GA lesions. Invasive methods, such as fluorescence angiography, often result in a poor patient experience and entail high costs due to pupil dilation and sodium fluorescein injection. While it remains the gold standard for assessing neovascular AMD and offers significant diagnostic insights for retinal vascular diseases, in most cases, noninvasive fundus images are used for GA diagnosis and management []. Color fundus photography (CFP), fundus autofluorescence (FAF), and near-infrared reflectance (NIR) are based on 2D images, which can generally produce results to quantify the atrophic area but fail to identify the retinal structure axially []. Compared with fundus imaging, optical coherence tomography (OCT) provides high-resolution, noninvasive 3D images of retinal structures for macular assessment. In addition, conventional B-scan (axial direction) OCT images can be integrated with en-face scans, facilitating the identification of atrophy borders similar to FAF [,]. Nonetheless, manual labeling is tedious, time-consuming, and impractical in a clinical setup []. There is an urgent and unmet need for early detection and management of GA using retinal image modalities. Recent advancements in AI, especially deep learning (DL), present a promising opportunity for enhancing GA detection, classification, segmentation, quantification, and prediction.

    In the 1950s, AI referred to computer systems capable of performing complex tasks that historically only a human could do. So what is AI? How is it used in medicine today? And what may it do in the future? AI refers to the theory and development of computer systems capable of performing tasks that historically required human intelligence, such as recognizing speech, making decisions, and identifying patterns. AI is an umbrella term that encompasses a wide variety of technologies, including machine learning (ML) and DL []. ML is a subfield of AI that uses algorithms trained on datasets to create self-learning models capable of predicting outcomes and classifying information without human intervention []. ML refers to the general use of algorithms and data to create autonomous or semiautonomous machines. DL, meanwhile, is a subset of ML that layers algorithms into “neural networks” with 3 or more layers. Thus, it somewhat resembles the human brain, enabling machines to perform increasingly complex tasks []. DL algorithms generally have high and clinically acceptable diagnostic accuracy across different areas (ophthalmology, respiratory, breast cancer, etc) in radiology []. Within ophthalmology, DL algorithms showed reliable performance for detecting multiple findings in macular-centered retinal fundus images []. Therefore, automatic GA segmentation plays a vital role in the diagnosis and management of advanced AMD and its application in the clinical setting.

    Given the rapid evolution of AI applications in ophthalmology and the growing clinical importance of GA, this study aimed to systematically review the current evidence on AI-based approaches for the detection and management of GA secondary to dry AMD using noninvasive imaging modalities. We aimed to evaluate diagnostic accuracy relative to reference standards and examine methodological challenges to inform the design of future research and clinical implementation.

    Protocol and Registration

    Before starting this systematic review and meta-analysis, we registered a protocol on the PROSPERO website. This review adhered to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) and PRISMA-DTA (PRISMA of Diagnostic Test Accuracy) checklists [,].

    Eligibility Criteria

    We included studies using AI algorithms to detect, classify, identify, segment, quantify, or predict GA secondary to AMD from CFP, OCT, OCT angiography, FAF, or NIR. The data were from participants, with or without symptoms, who were diagnosed with GA (or cRORA) secondary to nonexudative AMD. Study designs were not restricted; multicenter or single-center, prospective or retrospective, post hoc analysis, clinical study, or model development studies were all accepted. Eyes with neovascular complications or macular atrophy from causes other than AMD, any previous anti-vascular endothelial growth factor treatment, any confounding retinopathy, or poor image quality were excluded.

    Electronic Search Strategy

    Two consecutive searches were conducted on PubMed, Embase, Web of Science, Scopus, Cochrane Library, and CINAHL. Because this review required the extraction of baseline data and items, considering the completeness of the data, we did not conduct any in press or print source searches and excluded conference proceedings and similar materials. The initial search was completed from the date of entry to December 1, 2024; the updated search, from December 1, 2024, to October 5, 2025. We used a search strategy for patient (GA) and index tests (AI and retinal images) that had been used in previous Cochrane Review without any search peer review process []. There were no restrictions on the date of publication. The language was limited to English. In , detailed search strategies for each database are provided. During this process, no filters were used. During the search process, we adhered to the PRISMA-S (Preferred Reporting Items for Systematic reviews and Meta-Analyses literature search extension) reporting guidelines [].

    Selection Process

    All relevant literature was imported into EndNote (version 20; Clarivate Analytics) software, and literature screening was conducted independently by 2 researchers (NS and JL) who specialize in ophthalmology. Duplicates were removed from the software, and the titles and abstracts of the literature were reviewed to identify those relevant to the topic. Finally, the full texts were downloaded and examined, leading to the selection of literature that met the inclusion criteria. In cases of inconsistencies in the final inclusion decisions made by the 2 researchers, a third professional (LL) was consulted to resolve the dispute.

    Data Collection Process

    Using standardized data items, the data were extracted independently from the included studies by 2 researchers (NS and JL). A third review author (LL) confirmed or adjudicated any discrepancies through group discussion. We retrieved the following data items: (1) study characteristics (author, year, study design, region, and theme), (2) dataset characteristics (databases, source of databases, training/validation/testing ratio, patient number, number of images or volumes, scan number, mean age, clinical registration number, and model evaluation method), (3) image and algorithm characteristics (devices, metrics, image modality, image resolution, and AI algorithms), (4) performance metrics (outcomes, performance of models, ground truth, and performance of the ophthalmologists), and (5) main results. All the information was retrieved from the main text and the tables provided in . Therefore, we did not seek additional data by contacting the authors or experts. In some studies, the authors reported multiple sets of performance data based on a subset of a single dataset. For example, they may have reported results such as sensitivity, specificity, accuracy, and so forth, conducted on the cross-validation set, the test set, or the development set. We referred to the relevant literature to select the optimal set of test performance results []. However, when the primary study provided performance results based on a single test, the development dataset was used to train the AI model, and an external validation set ultimately was used to determine the performance of the optimal model. We extracted the external validation set performance data [].

    Risk of Bias and Application

    We worked in pairs to assess the risk of bias and the applicability of the studies, which involved detection, classification, identification, segmentation, and quantification using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-AI [] and the modified QUADAS-2 tool [], while predictive studies used the Prediction Model Risk of Bias Assessment Tool (PROBAST) [].

    In the current context, QUADAS-AI has not yet established a complete specification of items. Therefore, we referenced the examples provided by QUASAS-AI and the published literature to compile the revised QUADAS-AI items, which included 4 domains and 9 leading questions (Table S4 in ). The PROBAST tool comprises participants, predictors, outcomes, and analysis, containing 20 signaling questions across 4 domains (Table S5 in ). We also evaluated the applicability of the study based on the leading or signaling questions in the first 3 domains. A study with “yes” answers to all index questions was considered to have a low risk of bias. If the answer to any of the informational questions was “no,” there was a potential for bias, leading the authors to rate the risk of bias as high. “Indeterminate” grades were applied when detailed content was not provided in the literature, making it difficult for the evaluator to reach a judgment. They were used only when the reported data were insufficient. Throughout the process, disagreements between the 2 reviewers (NS and JL) were resolved by consulting the senior reviewer (LL).

    Data Synthesis

    As very few studies reported the number of true positives, true negatives, false positives, and false negatives, we restricted the quantitative analysis to determine the diagnostic accuracy of AI as a triaging tool for GA secondary to nonexudative AMD. However, a meta-analysis was not performed due to significant methodological heterogeneity across studies, arising from diverse AI architectures, imaging modalities, outcome metrics, and validation protocols. Instead, a systematic review was performed to qualitatively summarize performance trends. This approach allowed for a comprehensive evaluation of the AI capabilities in the detection and management of GA via noninvasive images.

    Study Selection

    A total of 979 records related to the topic of this systematic review were searched across 6 different databases using a combination of subject terms and free-text terms. After removing duplicates, 335 records remained and were examined for titles and abstracts. Excluding studies not relevant to the research topic resulted in 200 reports. The full texts were then downloaded and reviewed in detail based on the eligibility criteria for the studies. In the final qualitative analysis, 41 studies were included. Of these, 10 focusing on GA diagnosis, 20 on GA assessment and progression, and 11 on GA prediction. presents the detailed flow diagram of the literature selection.

    Figure 1. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram for literature selection. GA: geographic atrophy.

    AI in Detecting the Presence of GA

    Ten of the 41 included studies focused on AI-based detection of GA using noninvasive retinal images (Table S1 in ). As listed in , the studies were published from 2018 to 2025. Four of the studies [-] focused on model development, 3 [-] were retrospective studies, and 3 [-] were prospective studies (1 multicenter cohort study, 1 multicenter and low-interventional clinical study, and 1 clinical study). Geographically, half were from the United States, with others from Israel, Italy, Switzerland, Germany, and a multicenter European collaboration. The studies addressed several detection-related tasks: 5 focused solely on GA detection [-,,], 2 covered detection and classification [,], and others integrated detection with quantification or segmentation [,,].

    Table 1. Characteristics of studies evaluating artificial intelligence (AI) models for geographic atrophy (GA) detection using noninvasive retinal imaging.
    Author Study design Region Purpose of the study Source of datasets Number of patients Number of images or scans Model evaluation method Image modality (image resolution) AI algorithms Outcomes Performance of models
    Fineberg et al [] Retrospective cohort study Israel (Petah Tikva) Detection and classification (GA) Rabin Medical Center 113 659 10-fold cross-validation NIR (640*640 pixels) CNNs: ResNet50, EfficientNetB0, ViT_B_16, and YOLOv8 variants. ACC, P, SEN, SPE, F1, IoU, and DSC
    • GA classification:
      EfficientNetB0: ACC=0.9148; P=0.9204; SEN=0.9233; SPE=1.0; F1=0.9147.
    • ResNet50: ACC=0.8815; P=.8933; SEN=0.8917; SPE=0.9833; F1=0.8812.
    • ViT_B_16: ACC=0.963; P=.9632; SEN=0.9667; SPE=1.0; F1=0.9629.
    • GA detection: YOLOv8-Large: SEN=0.91; P=0.91; IoU=0.84; DSC=0.88.
    Kalra et al [] Retrospective clinical study United States (Cleveland) Detection, quantification, and segmentation (presence of GA and pixel-wise GA area measurement) the Cole Eye Institute of the Cleveland Clinic 341 900 triple-fold cross-validation SD-OCT (256*256 pixels) CNN: U-Net F1, ACC, P, R, SEN, and SPE
    • GA detection- ACC=0.91, SEN=0.86, SPE=0.94, F1=0.87.
    • GA segmentation: ACC=0.96, SEN=0.95, SPE=0.93, F1=0.82.
    Derradji et al [] Retrospective clinical study Switzerland (Lausanne) Detection and segmentation (RORA) An existing image database of the Medical Retina Department at Jules-Gonin Eye Hospital 57 62 5-fold cross-validation SD-OCT (NR) CNN: U-Net SEN, DSC, P, and Kappa
    • Grader 1: DSC: mean 0.881 (SD 0.074); Precision: mean 0.928 (SD 0.054); SEN: mean 0.850 (SD 0.119); Kappa: mean 0.846 (SD 0.072).
    • Grader 2: DSC: mean 0.844 (SD 0.076); Precision: mean 0.799 (SD 0.133); SEN: mean 0.915 (SD 0.064); Kappa: mean 0.800 (SD 0.082).
    de Vente et al [] Prospective multicenter and low-interventional clinical study (including cross-sectional and longitudinal study part) 20 sites in 7 European countries Detection and quantification (cRORA) The MACUSTAR Study Cohort 168 143 (ZEISS); 167 (Spectrails) NR SD-OCT (512*650 pixels) CNN: U-Net SEN, SPE, PPV, NPV, and Kappa
    • ZEISS: SEN=0.6; SPE=0.964; PPV=0.375; NPV=0.985.
    • Spectralis: SEN=0.625; SPE=0.974; PPV=0.714; NPV=0.961.
    Sarao et al [] Prospective clinical study Italy (Udine) Detection (presence of GA) the Istituto Europeo di Microchirurgia Oculare (IEMO) study 180 540 NR CFP (NR) CNN: Efficientnet_b2 SEN, SPE, ACC, F1, R, AUROC, and AUPRC
    • SEN: 100% (95%CI 83.2%-100%); SPE=97.5% (95% CI 86.8%-99.9%); ACC=98.4%; F1=0.976; R=1; AUROC=0.988 (95% CI 0.918-1); AUPRC=0.952 (95%CI 0.719-0.994).
    Keenan et al [] Multicenter and prospective cohort study United States (Maryland) Detection (presence of GA) Age-Related Eye Disease Study (AREDS) dataset 4582 59,812 5-fold cross-validation CFP (512 pixels) CNN: inception v3 ACC, SEN, SPE, P, AUC, and Kappa
    • ACC=0.965 (95% CI 0.959-0.971); Kappa=0.611 (95% CI 0.533-0.689); SEN=0.692 (95% CI 0.560-0.825); SPE=0.978 (95% CI 0.970-0.985); Precision=0.584 (95% CI 0.491-0.676).
    Yao et al [] Model development and evaluation United States (California) Detection (presence of nGA) the Early Stages of AMD (LEAD) study 140 1884 5-fold cross-validation SD-OCT (512*496 pixels) CNN: ResNet18 SEN, SPE, ACC, P, and F1
    • SEN=0.76 (95% CI 0.67-0.84); SPE=0.98 (95% CI 0.96-0.99); PRE=0.73 (95% CI 0.54-0.89); ACC=0.97 (95% CI 0.95-0.98); F1=0.74 (95% CI 0.61-0.84).
    Chiang et al [] Model development United States (California) Detection (complete retinal pigment epithelial and outer retinal atrophy (cRORA) in eyes with AMD) University of Pennsylvania, University of Miami, and Case Western Reserve University; (2) Doheny Image Reading Research Laboratory, Doheny-UCLA (University of California Los Angeles Eye Centers) 71 (training); 649 (testing #1); 60 (testing #2) 188 (training); 1117 (testing #1) 5-fold cross-validation SD-OCT (256*256 pixels) CNN: ResNet18 SEN, SPE, PPV, NPV, AUROC, and AUPRC
    • SEN=0.909 (95% CI 0.778-1.000); SPE=0.553 (95% CI 0.394-0.703); PPV=0.541 (95% CI 0.375-0.707); NPV=0.913 (95% CI 0.778-1.000); AUROC=0.84 (95% CI 0.75-0.94); AUPRC=0.82 (95% CI 0.70-0.93).
    Elsawy et al [] Model development United States (Maryland) Detection (explain decision making and compare methods) The Age-Related Eye Disease Study 2 (AREDS2) Ancillary SD-OCT study from Devers Eye Institute, Emory Eye Center, Duke Eye Center, and the National Eye Institute 311 1284 scans 10-fold cross-validation SD-OCT (128*128 or 224* pixels) 3D CNN: deep-GA-Net ACC, P, R, F1, Kappa, AUROC, and AUPRC
    • ACC=0.93 (95% CI 0.92-0.94); Precision=0.90 (95% CI 0.88-0.91); Recall=0.90 (95% CI 0.89-0.92); F1 score=0.90 (95% CI 0.89-0.91); Kappa=0.80 (95% CI 0.77-0.83); AUROC=0.94 (95% CI 0.93-0.95); AUPRC=0.91 (95% CI 0.90-0.93).
    Treder et al [] Model development Germany (Muenster) Detection and classification (GA) Public database: ImageNet 400 (training); 60 (test set) 400 (training); 60 (test set) NR FAF (NR) Deep CNN: self-learning algorithm SEN, SPE, and ACC
    • Probability score: mean 0.981 (SD 0.048); SEN=100%; SPE=100%; ACC=100%.

    aAI: artificial intelligence.

    bACC: accuracy.

    cAUPRC: area under the precision-recall curve.

    dCNN: convolutional neural network.

    eCFP: color fundus photography.

    fcRORA: complete retinal pigment epithelium and outer retinal atrophy.

    gDSC: dice similarity coefficient.

    hFAF: fundus autofluorescence.

    iIoU: intersection over union.

    jNR: not reported.

    kOCT: optical coherence tomography.

    lPPV: positive predictive value.

    mP: precision.

    nR: recall.

    oSD-OCT: spectral domain OCT.

    pSEN: sensitivity.

    qSPE: specificity.

    rAUROC: area under the receiver operating characteristic curve.

    sAMD: age-related macular degeneration.

    tNPV: negative predictive value.

    Dataset configurations varied: 6 studies used training, validation, and test sets [-,,]; 3 used only training and test sets [,,]; and 1 included a tuning set []. Collectively, these studies involved at least 7132 participants, with ages ranging from 50 to 85 years. Three studies were registered with ClinicalTrials.gov (NCT00734487, NCT01790802, and NCT03349801) [,,]. Cross-validation methods included 5-fold (40% of studies) [,,,], 10-fold (20%) [,], and triple-fold (10%) []; 30% did not report validation details.

    Spectral-domain (SD)–OCT was the most frequently used imaging modality (6/10 of studies) [-,,,], followed by CFP (2/10) [,], and FAF or NIR (2/10 each) [,]. Most studies applied image preprocessing techniques—such as size standardization, orientation adjustment, intensity normalization, and noise reduction—to improve model performance. DL-based algorithms for GA detection have been developed for multiple image modalities. For example, Derradji et al [] trained a convolutional neural networks (CNNs), U-Net, to predict atrophic signs in the retina, based on the EfficientNet-b3 architecture. Kalra et al [] and de Vente et al [] also trained a DL model based on U-Net. Yao et al [] applied 3D OCT scans with ResNet18 pretrained on the ImageNet dataset, and Chiang et al [] developed CNN (ResNet18) to improve computational efficiency. Elsawy et al [] proposed Deep-GA-Net, a 3D backbone CNN with a 3D loss-based attention layer, and evaluated the effectiveness of using attention layers. Sarao et al [] used a deep CNN, the EfficientNet_b2 model, which was pretrained on the ImageNet dataset and is well-known for its high efficiency and small size. Keenan et al [] established their model using Inception v3, while Treder et al [] performed a deep CNN, a self-learning algorithm, processing input data with FAF images.

    A total of 14 performance sets were extracted from the 10 studies. Key metrics included sensitivity, specificity, accuracy, positive predictive value, negative predictive value, intersection over union, area under the receiver operating characteristic curve, area under the precision-recall curve, F1-score, precision, recall, Kappa, and dice similarity coefficient. Six OCT-based studies showed that DL models could detect GA with high accuracy, comparable to human graders [-,,,]. Two studies using CFP also reported strong performance [,], while FAF- and NIR-based approaches demonstrated excellent repeatability and reliability [,].

    We conducted a thorough evaluation of the 10 diagnostic studies’ methodological quality for the “participant selection,” “index test,” “reference standard,” and “flow and timing” domains at the study level (). None of the studies had an overall low or unclear risk of bias; instead, every study had a high risk of bias in at least 1 of the 4 domains. Regarding “patient selection,” only 4 studies [,,,] described the eligibility criteria; the rest did not report them. One study [] used an open dataset (ImageNet) and did not include a test set. The small sample size of 4 studies [,,,] may have resulted in overfitting. In addition, 3 studies [,,] did not report image formats and resolutions. Five studies [,,-] had a high risk of bias in participant selection because the included participants were not only GA secondary to dry AMD but also had other unrelated diseases. Regarding the “Index test,” only 1 algorithm was externally validated using a different dataset []; all other items were evaluated as low risk.

    Table 2. Methodological quality and applicability assessment for studies on geographic atrophy (GA) detection using the revised Quality Assessment of Diagnostic Accuracy Studies–Artificial Intelligence (QUADAS-AI).
    Study Risk of bias Concerns regarding applicability
    Patient selection Index test Reference standard Flow and timing Patient selection Index test Reference standard
    Chiang et al [] High risk Low risk Low risk Low risk Low risk Low risk Low risk
    Elsawy et al [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    Kalra et al [] High risk High risk Low risk Low risk High risk Low risk Low risk
    Keenan et al [] High risk High risk Low risk Low risk High risk Low risk Low risk
    Sarao et al [] High risk High risk Low risk Low risk High risk Low risk Low risk
    Yao et al [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    Treder et al [] High risk High risk Low risk Low risk High risk Low risk Low risk
    Vente et al [] High risk High risk Low risk Low risk High risk Low risk Low risk
    Derradji et al [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    Fineberg et al [] High risk High risk Low risk Low risk Low risk Low risk Low risk

    AI in GA Assessment and Progression

    Twenty studies explored AI for GA assessment and progression using noninvasive imaging, published between 2019 and 2025 (Table S2 in ). As shown in , these studies covered 11 segmentation [,,-], 2 algorithm optimization [,], 3 AMD progression classification [-], and 3 combined tasks such as identification, segmentation, and quantification [-]. One study focused solely on GA quantification []. Retrospective analyses accounted for 9 studies [,,,,,,,,], while 7 were model development [,-,,,], and the remainder were prospective [,], comparative [], or cross-sectional []. Geographically, contributions came from China (6/20), the United States (7/20), the United Kingdom (2/20), Australia (2/20), France (1/20), Israel (1/20), and Austria (1/20).

    Table 3. Characteristics of studies evaluating artificial intelligence (AI) models for geographic atrophy (GA) assessment and progression using noninvasive retinal imaging.
    Author Study design Region Purpose of the study Source of datasets Number of patients Number of images or scans Model evaluation method Image modality (Image resolution) AI algorithms Outcomes Performance of models
    Pramil et al [] Retrospective review of images United States (Boston) Segmentation (GA lesions) The “SWAGGER” cohort of the non-Exudative Age-Related Macular Degeneration (from New England Eye Center at Tufts Medical Center) 90 126 5-fold cross-validation SS-OCT (500*500 pixels) CNN: U-Net SEN, SPE, and DICE
    • SEN=0.95; SPE=0.91; DSC (vs G1): mean 0.92 (SD 0.11); DSC (vs G2): mean 0.91 (SD 0.12).
    Siraz et al [] Retrospective comparative study United States (North Carolina) Classification (central and noncentral GA) Atrium Health Wake Forest Baptist 104 355 NR SD-OCT (224*224 pixels) CNNs: ResNet50, MobileNetV2, and ViT-B/16 AUROC, F1, and ACC
    • ResNet50: AUROC: mean 0.545 (SD 0.004), F1: mean 0.431 (SD 0.00); ACC: mean 0.756 (SD 0.00).
    • MobileNetV2: AUROC: mean 0.521 (SD 0.016), F1: mean 0.432 (SD 0.002); ACC: mean 0.756 (SD 0.00).
    • ViT-B/16: AUROC: mean 0.718 (SD 0.002), F1: mean 0.602 (SD 0.004); ACC: mean 0.780 (SD 0.005).
    Arslan et al [] Retrospective cohort clinical study Australia (Victoria) Segmentation (GA lesion area) The Center for Eye Research Australia or a private ophthalmology practice diagnosed with GA 51 702 5-fold cross-validation FAF (768*768 or 1536*1536 pixels) CNN: U-Net DSC, DSC loss, SEN, SPE, MAE, ACC, R, and P
    • DSC: mean 0.9780 (SD 0.0124); DSC loss: mean 0.0220 (SD 0.0041); SEN: mean 0.9903 (SD 0.0041); SPE: mean 0.7498 (SD 0.0955); MAE: mean 0.0376 (SD 0.0184); ACC: mean 0.9774 (SD 0.0090); P: mean 9837 (SD 0.0116).
    Hu et al [] Retrospective clinical study China (Shenyang) Classification (dry AMD progression phases) Shenyang Aier Eye Hospital 338 3401 5-fold cross-validation SD-OCT (NR) CNNs: EfficientNetV2, DenseNet169, Xception, and ResNet50NF ACC, SEN, SPE, F1, Macro-f1, and Kappa
    • ACC=97.31%; SEN=89.25%; SPE=98.80%; F1=91.21%; Macro-f1=92.08%; Kappa=95.45%.
    Spaide et al [] Retrospective analysis and model comparison United States (Washington) Segmentation (GA lesion area) The SWAGGER cohort from the New England Eye Center at Tufts Medical Center 87 126 scans 5-fold cross-validation SS-OCT (NR) CNN: U-Net DSC
    • UNet-1: 0.82 (95% CI 0.78-0.86).
    • UNet-Avg: 0.88 (95% CI 0.85-0.91).
    • UNet-Drop: 0.90 (95% CI 0.87-0.93).
    Vogl et al [] Retrospective analysis Austria (Vienna) Identification (GA progression after pegcetacoplan treatment) The FILLY trial 156 NR NR SD-OCT (512*512 pixels) CNN: U-Net LPR
    • Compared with sham treatment, monthly: −28% (−42.8 to −9.4).
    • Every other month: −23.9% (−40.2 to −3.0).
    Szeskin et al [] Retrospective analysis Israel (Jerusalem) Identification, quantification (GA lesion) Datasets D1 and D2 from the Hadassah University Medical Center D1: 18; D2: 16 NR 4-fold cross-validation SD-OCT (496*1024 pixels and 496*1536 pixels) CNN: the custom column classification CNN AUROC, P, R, and F1
    • AUROC=0.970; (Segment) P: mean 0.84 (SD 0.11); R: mean 0.94 (SD 0.03); (Lesion) P: mean 0.72 (SD 0.03); R: mean 0.91 (SD 0.18).
    Spaide et al [] Retrospective analysis United States (California) Segmentation (GA lesion area) Proxima A and B Proxima A: 154; Proxima B: 183 Proxima A: 497; Proxima B: 940 NR FAF, NIR (768 *768 pixels) Multimodal DL: U-Net; YNet DSC and r2
    • (G1-Ynet)DSC: mean 0.92 (SD 0.09).
    • (G1-Unet)DSC: mean 0.90 (SD 0.09).
    • (G2-Ynet)DSC: mean 0.91 (SD 0.09).
    • (G2-Unet)DSC: mean 0.90 (SD 0.09).
    • (Ynet) r2: 0.981.
    • (Unet) r2: 0.959.
    AI-khersan et al [] Retrospective analysis United States (Texas) Segmentation (GA) The Retina Consultants of Texas and Retina Vitreous Associates 33; 326 367; 348 5-fold cross-validation SD-OCT (512*496pixels; 200*1024pixels) CNN: 3D-to-2D U-Net DSC and r2
    • For Spectralis data, DSC=0.826; r2=0.906.
    • For Cirrus data, DSC=0.824; r2=0.883.
    Chu et al [] Prospective study United States (Washington) Identification, segmentation, and quantification (GA) The University of Miami 70; 20; 25 NR NR SS-OCT (512*512 pixels) CNN: U-Net DSC, SEN, and SPE
    • DSC: mean 0.940 (SD 0.032). SEN=100%; SPE: 100%.
    Merle et al [] Prospective observational study Australia (Victoria) Quantification (GA) The Center for Eye Research Australia 50 NR NR SD-OCT; FAF (NR) CNN: U-Net Spearman correlation coefficient and SEN
    • (OCT-automatically) Spearman correlation coefficient=0.85 (95% CI 0.71-0.91); SEN=0.59.
    Yang et al [] Model development China (Shenyang) Classification (stage of dry AMD progression) Shenyang Aier Excellence Eye Hospital 1310 16,384 3-fold cross-validation SD-OCT (NR) CNNs: ResNet50, EfficientNetB4, MobileNetV3, Xception ACC, SEN, SPE, and F1
    • ACC(GA): ResNet50=92.35%; EfficientNetB4=93.85%; MobileNetV3=89.64%; Xception=91.16%.
    • ACC (nascent GA): ResNet50=91.56%; EfficientNetB4=89.66%; MobileNetV3=89.43%; Xception=85.22%.
    Ji et al [] Model development China (Nanjing) Segmentation (GA lesion area) Dataset1 and dataset2 8; 54 NR NR SD-OCT (224*224 pixels) Weakly supervised multitask learning: Mirrored X-Net DSC, IoU, AAD, and CC
    • DSC: mean 0.862 (SD 0.080); IoU: mean 0.765 (SD 0.119); AAD: mean 0.090 (SD 0.090); CC: 0.992.
    Ma et al [] Model development China (Jinan) Segmentation (GA lesion area) Dataset1 and dataset2 62 NR 5-fold cross-validation SD-OCT (224*224 pixels) Weakly supervised model: VGG16 DSC, OR, AAD, CC, and AUROC
    • DSC: mean 0.847 (SD 0.087); OR: mean 0.744 (SD 0.126); AAD: mean 0.150 (SD 0.149); CC: 0.969; AUROC: 0.933.
    Royer et al [] Model development France (Issy-Les-Moulineaux) Segmentation (GA lesion area) the Clinical Imaging Center of the Quinze-Vingts Hospital 18 328 8 different random combinations of 12 series to train the model and 6 for the tests NIR (256*256 pixels) Unsupervised neural networks: W-net F1, P, and R
    • F1: mean 0.87 (SD 0.07); P: mean 0.90 (SD 0.07); R: mean 0.85 (SD 0.11).
    Xu et al [] Model development China (Jinan) Segmentation (GA lesion area) dataset1 and dataset2 8 (test I); 56 (test II) 55 (dataset1); 56 (dataset2) NR SD-OCT (1024*512*128pixels; 1024*200*200pixels) Self-learning algorithm OR, AAD, and CC
    • OR: mean 84.48% (SD 11.98%); AAD: mean 11.09% (SD 13.61%); CC: 0.9948.
    Zhang et al [] Model development United Kingdom (London) Segmentation and quantification (GA) The FILLY study 200 984 NR SD-OCT (NR) CNN: U-Net DSC, ICC, ACC, SEN, SPE, and F1
    • Approach 1: ACC=0.91 (95% CI 0.89-0.93); F1=0.94 (95% CI 0.92-0.96); SEN=0.99 (95% CI 0.97-1.00); SPE=0.54 (95% CI 0.47-0.61); DSC: mean 0.92 (SD 0.14); ICC=0.94.
    • Approach 2: ACC=0.94 (95% CI 0.92-0.96); F1=0.96 (95% CI 0.94-0.98); SEN=0.98 (95% CI 0.96-1.00); SPE=0.76 (95% CI 0.70-0.82); DSC: mean 0.89 (SD 0.18); ICC: 0.91.
    Liu et al [] Model development China (Wuhan) Segmentation (GA) Wuhan Aier Eye Hospital; the public dataset OCTA500 300 2923 5-fold cross-validation SD-OCT (512*512 pixels) Self-learning algorithm (dual-branch image projection network) Jaccard index, DSC, ACC, P, and R
    • DSC: mean 7.03 (SD 2.73); Jaccard index: mean 80.96 (SD 4.29); ACC: mean 91.84 (SD 2.13); P: mean 87.12 (SD 2.34); R: mean 86.56 (SD 2.92).
    Williamson et al [] Cross-sectional study United Kingdom (London) Segmentation (GA lesion area) INSIGHT Health Data Research Hub at Moorfields Eye Hospital 9875 (OCT); 81 (FAF) NR NR 3D-OCT; FAF (512*512 pixels) Self-learning algorithm PPV
    Safai et al [] Comparative analysis United States (Wisconsin) Identification (the best AI framework for segmentation of GA) AREDS2 study; the GlaxoSmithKline (GSK) study 271(AREDS2); 100(GSK) 601 (AREDS2); 156 (GSK) 5-fold cross-validation FAF (512*512 pixels) CNNs: UNet, FPN, PSPNet, EfficientNet, ResNet, VGG, mViT CC and DSC
    • FPN_EfficientNet: CC=0.98, DSC=0.931.
    • FPN_CCesNet: CC=0.98, DSC=0.902.
    • FPN_VGG: CC=0.98, DSC=0.934.
    • FPN_mViT: CC=0.99, DSC=0.939.
    • UNet_EfficientNet: CC=0.98, DSC=0.924.
    • UNet_CCesNet: CC=0.97, DSC=0.930.
    • UNet_VGG: CC=0.97, DSC=0.896; UNet_mViT: CC=0.99, DSC=0.938.
    • PSPNet_EfficientNet: CC=0.93, DSC=0.890.
    • PSPNet_CCesNet: CC=0.87, DSC=0.877.
    • PSPNet_VGG: CC=0.95, DSC=0.900.
    • PSPNet_mViT: CC=0.98, DSC=0.889.

    aSS-OCT: swept-source OCT.

    bCNN: convolutional neural network.

    cSEN: sensitivity.

    dSPE: specificity.

    eDSC: dice similarity coefficient.

    fNR: not reported.

    gSD-OCT: spectral domain OCT.

    hAUROC: area under the receiver operating characteristic curve.

    iACC: accuracy.

    jCGA: central geographic atrophy.

    kNCGA: noncentral geographic atrophy.

    lFAF: fundus autofluorescence.

    mMAE: mean absolute error.

    nR: recall.

    oP: precision.

    pAMD: age-related macular degeneration.

    qLPR: local progression rate.

    rNIR: near-infrared reflectance.

    sDL: deep learning.

    tr2: Pearson correlation coefficient.

    uOCT: optical coherence tomography.

    vIoU: intersection over union.

    wAAD: absolute area difference.

    xCC: correlation coefficient.

    yOR: overlap ratio.

    zICC: intraclass coefficient.

    aaPPV: positive predictive value.

    abAREDS2: Age-Related Eye Disease Study 2.

    acFPN: Feature Pyramid Network.

    adVGG: Visual Geometry Group.

    aemViT: Mix Vision Transformer.

    Dataset configurations varied: 9 out of 20 studies used training, validation, and test sets [,,-,-]; 11 studies used training and test sets [,,-,]; 2 studies used training and validation sets [,]; 1 study comprised training, tuning, and internal validation sets []; and 2 studies did not specify [,]. Across studies, at least 14,064 participants provided image data for analysis. Less than half of the studies (9/20, 45%) provided demographic information, with the average age of participants ranging from 55 to 94 years. Six studies were registered with ClinicalTrials.gov (NCT01342926, NCT02503332, NCT02479386, NCT02399072, and NCT04469140 [,,,,,]). To assess the generalization ability of the DL model, cross-validation methods included 5-fold (8/20 studies [,,,-,]), 4-fold (1/20 study []), 3-fold (1/20 study []), and other approaches (1/20 study []). Nine studies did not report validation specifics.

    Multiple imaging modalities supported GA assessment: spectral domain optical coherence tomography (SD-OCT) was most common, followed by swept-source OCT (SS-OCT), 3D-OCT, FAF, and NIR. Preprocessing techniques were widely applied to standardize images and improve model performance. Algorithm architectures varied, with U-Net being the most frequently used. Other approaches included custom CNNs, self-learning algorithms, weakly supervised models, and multimodal networks. For example, Hu et al [] trained the DL models (ResNet-50, Xception, DenseNet169, and EfficientNetV2), evaluating them on a single fold of the validation dataset, with all F1-scores exceeding 90%. Yang [] proposed an ensemble DL architecture that integrated 4 different CNNs, including ResNet50, EfficientNetB4, MobileNetV3, and Xception, to classify dry AMD progression stages. GA lesions on FAF were automatically segmented using multimodal DL networks (U-Net and Y-Net) fed with FAF and NIR images []. In contrast to the multimodal algorithms mentioned above (ie, the examples of DL models), Safai [] investigated 3 distinct segmentation architectures along with 4 commonly used encoders, resulting in 12 different AI model combinations to determine the optimal AI framework for GA segmentation on FAF images.

    From 20 studies, 42 performance sets were collected. Common metrics included correlation coefficient, mean absolute error, Spearman correlation coefficient, intraclass coefficient, overlap ratio, Pearson correlation coefficient (r2), Kappa, specificity (SPE), sensitivity (SEN), accuracy, positive predictive value (PPV), F1-score, P, R, intersection over union, and dice similarity coefficient (DSC). Regarding the segmentation, classification, identification, and quantification of GA in SD-OCT, 12 studies demonstrated performance comparable to that of clinical experts [,,,,,-,,]. AI was also capable of efficiently detecting, segmenting, and measuring GA in SS-OCT, 3D-OCT, and FAF images, according to 4 studies [,,,]. AI for GA segmentation in FAF and NIR images, with clinical data showing good segmentation performance [,,].

    We performed a comprehensive assessment of the methodological quality of 16 GA assessment and progression studies encompassing 4 domains (). Only 8 studies detailed the eligibility criteria in the “patient selection” category, while the others had not been published. Three of the studies [-] lacked complete datasets, and 3 others [,,] had small datasets or limited volumes of data. In addition, 3 studies [,,] failed to provide information on image formats or resolutions. Two studies [,] were ranked as high risk regarding patient selection since the participants included other types of dry AMD (drusen, nascent GA). In terms of applicability, 18 studies were classified as low risk, while 2 were deemed high risk concerning patient selection. Concerning the “Index test,” only 3 algorithms underwent external validation with a different dataset [,,]. All other items were evaluated as low risk.

    Table 4. Methodological quality and applicability summary of geographic atrophy (GA) assessment and progression studies using revised Quality Assessment of Diagnostic Accuracy Studies–Artificial Intelligence (QUAUAS-AI).
    Study Risk of bias Concerns regarding applicability
    Patient selection Index test Reference standard Flow and timing Patient selection Index test Reference standard
    M Hu [] High risk High risk Low risk Low risk High risk Low risk Low risk
    JK Yang [] High risk High risk Low risk Low risk High risk Low risk Low risk
    A Safai [] Low risk Low risk Low risk Low risk Low risk Low risk Low risk
    WD Vogl [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    A Szeskin [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    ZD Chu [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    ZX Ji [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    X Ma [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    C Royer [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    T Spaide [] High risk Low risk Low risk Low risk Low risk Low risk Low risk
    T Spaide [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    DJ Williamson [] Low risk High risk Low risk Low risk Low risk Low risk Low risk
    RB Xu [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    J Arslan [] Low risk High risk Low risk Low risk Low risk Low risk Low risk
    V Pramil [] Low risk High risk Low risk Low risk Low risk Low risk Low risk
    GY Zhang [] High risk Low risk Low risk Low risk Low risk Low risk Low risk
    DA Merle [] High risk High risk Low risk Low risk Low risk Low risk Low risk
    H AI-khersan [] Low risk High risk Low risk Low risk Low risk Low risk Low risk
    S Siraz [] Low risk High risk Low risk Low risk Low risk Low risk Low risk
    XM Liu [] High risk High risk Low risk Low risk Low risk Low risk Low risk

    AI in Predicting GA Lesion Area and Progression

    Eleven studies used AI for predicting GA lesion growth and progression using noninvasive imaging (Table S3 in ). These studies were published between 2021 and 2025, with some information provided in . The study designs consisted of 6 retrospective studies [-], 2 model development studies [,], 2 post hoc analyses [,], and 1 clinical evaluation of a DL algorithm []. Participants or images came from various regions: 6 studies were based in the United States [,-,], 3 in Australia [-], 1 in Switzerland [], and another involving multiple centers in China and the United States []. Research aims focused on GA growth prediction [,,-,,], combined prediction and evaluation of lesion features [], treatment response assessment [], and integrated segmentation-prediction tasks [,].

    Table 5. Characteristics of studies evaluating artificial intelligence (AI) models for geographic atrophy (GA) prediction using noninvasive retinal imaging.
    Author Study design Region Purpose of the study Source of datasets Number of patients Number of images or scans or cubes Model evaluation method Image modality (resolution) AI algorithms Outcomes Performance of models
    Gigon et al [] Retrospective monocentric study Switzerland (Lausanne) Prediction (RORA progression) Jules Gonin Eye Hospital 119 NR NR SD-OCT (384*384 pixels) CNN: EfficientNet-b3 DSC
    • 0-6 months: 0.84
    • 6-12 months: 0.84
    • >12 months: 0.89
    Dow et al [] Retrospective cohort study United States (Atlanta, Georgia, Portland, Oregon, North Carolina; Maryland, Raleigh, Morrisville, Cary); United Kingdom (Durham, South Durham) Prediction (iAMD to GA within 1 year) 3 independent datasets from AREDS2 and a tertiary referral center and associated satellites 316; 53; 48 1085; 53; 48 5-fold cross-validation SD-OCT (512 *1000 pixels) CNN: Inception v3 SEN, SPE, PPV, NPV, ACC
    • SEN: 0.91 (95% CI 0.74-0.98); SPE: 0.80 (95% CI 0.63-0.91); PPV: 0.78 (95% CI 0.70-0.85); NPV: 0.92 (95% CI 0.90-0.95); ACC: 0.85 (95% CI 0.87-0.91)
    Cluceru et al [] Retrospective clinical study; observation study United States (California) Prediction and evaluation (GA growth rate and GA features related to shape and size) The lampalizumab phase 3 clinical trials and an accompanying observational study 1041; 255 NR 5-fold cross-validation FAF (384 * 384 pixels) CNN: VGG16 r2
    • Full FAF images: 0.44 (95% CI 0.36-0.49)
    • Rim only: 0.37 (95% CI 0.35-0.4)
    • Lesion only: 0.34 (95% CI 0.31-0.36)
    • Background only: 0.3 (95% CI 0.27-0.33)
    • Mask only: 0.27 (95% CI 0.24-0.29)
    Anegondi et al [] Retrospective clinical study; observation study United States (California) Prediction and prognosis (GA lesion area and GA growth rate after lampalizumab treatment) The lampalizumab phase 3 clinical trials and an accompanying observational study 1279; 443; 106; 169 NR 5-fold cross-validation SD-OCT, FAF (512*512 pixels) CNN: Inception v3 r2 GA prediction:

    • FAF-only: 0.98 (95% CI 0.97‐0.99)
    • OCT-only: 0.91 (95% CI 0.87‐0.95),
    • Multimodal: 0.94 (95% CI 0.92‐0.96).

    GA growth rate:

    • FAF-only: 0.65 (95% CI 0.52‐0.75),
    • OCT-only: 0.36 (95% CI 0.29‐0.43),
    • Multimodal: 0.47 (95% CI 0.40‐0.54)
    Salvi et al [] Retrospective analysis United States (California) Prediction (the 1 year region of growth of GA lesions) The following lampalizumab clinical trials and prospective observational studies 597 NR NR FAF (768*768 pixels or 1536*1536 pixels) CNN: U-Net P, R, DSC, r2 Whole lesion:

    • P: mean 0.70 (SD 0.12); R: mean 0.73 (SD 0.12); DSC: mean 0.70 (SD 0.09); r2: 0.79
    Yoshida [] Retrospective analysis United States (California) Prediction (GA progression) Three prospective clinical trials 1219; 442 NR 5-fold cross-validation 3D OCT (496*1024*49 voxels) CNNs: (1) en-face intensity maps; (2) SLIVER-net; (3) a 3D CNN; and (4) en-face layer thickness and between-layer intensity maps from a segmentation model r2
    • GA lesion area: En-face intensity map: 0.91; SLIVER-net: 0.83; 3D DenseNet: 0.90; OCT EZ and RPE thickness map: 0.90;
    • GA growth rate: En-face intensity map: 0.33; SLIVER-net: 0.33; 3D DenseNet: 0.35; OCT EZ and RPE thickness map: 0.35.
    GS Reiter [] Post hoc analysis Austria (Vienna) Prediction (GA lesions progression) the phase II randomized controlled trial FILLY 134 268 scans 5-fold cross-validation FAF, NIR, SD-OCT (NR) CNN: PSC-UNet ACC, Kappa, concordance index
    • ACC: 0.48; Kappa: 0.23; concordance index: 0.69
    J Mai [] Post hoc analysis Austria (Vienna) Segmentation, quantification, and prediction (GA lesion and progression) The phase 2 FILLY clinical trial and the Medical University of Vienna (MUV) 113; 100 226; 967 5-fold cross-validation SD-OCT, FAF (768*768 and 1536*1536 pixels) CNN: U-Net DSC, Hausdorff distance, ICC
    • MUV: DSC: mean 0.86 (SD 0.12); Hausdorff distance: mean 0.54 (SD 0.45);
    • FILLY: DSC: mean 0.91 (SD 0.05); Hausdorff distance: mean 0.38 (SD 0.40)
    YH Zhang [] Model development China (Nanjing); United States (California) Prediction (GA growth) The Byers Eye Institute of Stanford University; the Jiangsu Provincial People’s Hospital 22; 3 86 cubes; 33 cubes Leave-one-out cross-validation SD-OCT (178*270 pixels) Recurrent neural network: the bi-directional long-short term memory network; CNN: 3D-UNet DSC, CC
    • Scenario I: DSC: 0.86; CC: 0.83;
    • Scenario II: DSC: 0.89; CC: 0.84;
    • Scenario III: DSC: 0.89; CC: 0.86;
    • Scenario IV: DSC: 0.92; CC: 0.88;
    • Scenario V: DSC: 0.88; CC: 0.85;
    • Scenario VI: DSC: 0.90; CC: 0.86
    SX Wang [] Model development United States (California) Segmentation and prediction (GA lesion area and GA progression) The University of California—Los Angeles 147 NR 8-fold cross-validation SD-OCT, FAF (512*512 pixels) CNN: U-Net SEN, SPE, ACC, OR
    • ACC: 0.95; SEN: 0.60; SPE: 0.96; OR: 0.65
    J Mai [] Clinical evaluation of a DL-based algorithm Austria (Vienna) Prediction (GA lesions progression) The Medical University of Vienna 100 967 5-fold cross-validation SD-OCT, FAF (NR) CNN: PSC-UNet DSC, MAE, and r2
    • 0-1 year: DSC: mean 0.25 (SD 0.16); MAE: mean 0.13 (SD 0.11)
    • 1-2 years: DSC: mean 0.38 (SD 0.20); MAE: mean 0.25 (SD 0.24);
    • 2-3 years: DSC: mean 0.38 (SD 0.21); MAE: mean 0.35 (SD 0.34);
    • >3 years: DSC: mean 0.37 (SD 0.23); MAE: mean 0.72 (SD 0.48)

    aRORA: retinal pigment epithelial and outer retinal atrophy.

    bNR: not reported.

    cOCT: optical coherence tomography.

    dCNN: convolutional neural network.

    eDSC: dice similarity coefficient.

    fAMD: age-related macular degeneration.

    gAREDS2: Age-Related Eye Disease Study 2.

    hSEN: sensitivity.

    iSPE: specificity.

    jPPV: positive predictive value.

    kNPV: negative predictive value.

    lACC: accuracy.

    mFAF: fundus autofluorescence.

    nr2: Pearson correlation coefficient.

    oP: precision.

    pR: recall.

    qEZ: ellipsoid zone.

    rRPE: retinal pigment epithelium.

    sNIR: near-infrared reflectance.

    tICC: intraclass coefficient.

    uCC: correlation coefficient.

    vOR: overlap ratio.

    wMAE: mean absolute error.

    Dataset structures varied: 3 out of 11 studies used training-validation-test splits [,,]; 2 out of 11 studies used training-test sets [,]; 3 out of 11 studies used training-validation sets [,,]; and the rest adopted development–holdout [,] or development-holdout-independent test configurations []. In total, 6706 participants were included across studies. Fewer than half of the studies (4/11, 36.4%) reported demographic information, with mean age ranges spanning from 74 to 83 years [,,,]. Six studies [-,,] were ethically approved and registered on ClinicalTrials.gov under the following identifiers: NCT02503332, NCT02247479, NCT02247531, NCT02479386, NCT01229215, and NCT02399072. The DL model’s generalizability was assessed using leave-one-out cross-validation in 1 study [], 5-fold cross-validation in 7 studies [,,,,-], and 8-fold cross-validation in 1 study []. The remaining 2 studies [,] did not specify the cross-validation methodology.

    Studies used 3D-OCT, SD-OCT, NIR, and FAF images, primarily sourced from Heidelberg, Zeiss, and Bioptigen devices. While most reported image metrics, 2 studies did not specify resolution details [,]. Commonly used DL architectures included Inception v3 [,], PSC-UNet [,], U-Net [,,], EfficientNet-b3 [], and VGG16 []. In addition, some studies introduced novel approaches, such as en-face intensity maps, SLIVER-net, 3D CNN, and a recurrent neural network, for improved GA progression forecasting.

    According to various image modalities, datasets, and follow-up durations, we gathered 31 sets of performance data from 11 studies. The performance metrics included the Hausdorff distance, concordance index, overlap, SEN, SPE, accuracy, mean absolute error, Kappa, DSC, P, PPV, R, r2, and negative predictive value. The findings for a single image modality (3D-OCT, SD-OCT, or FAF) demonstrated the development of DL algorithms to predict GA growth rate and progression with excellent performance characteristics comparable to trained experts [-,-]. Multimodal approaches combining FAF, NIR, and SD-OCT further showed feasibility for individualized lesion growth prediction and localization [,-].

    In this systematic review, we used the PROBAST tool to rigorously evaluate prediction models across 4 domains, addressing 20 signaling questions for each paper reviewed. Within the “participants” domain, all studies used appropriate data sources; however, only 6 studies [-,,] clearly outlined their inclusion and exclusion criteria for participants, leaving the others unclear. In terms of “predictors,” these were defined and evaluated similarly for all participants, having no connection to outcome data and being available at baseline. All studies evaluated “yes” to the questions on outcome measurement methods, definitions, interference factors, and measurement time intervals. Concerning “analysis,” Dow [] and Zhang [] applied a small dataset with an insufficient number of participants. While Zhang performed internal validation, the lack of external validation notably limits the model’s generalizability, which was constructed with bi-directional long-short term memory networks and CNN frameworks. Two studies by Salvi [] and Yoshida [] lacked independent and external validation. Gigon [] failed to explicitly mention missing data handling, complex problems, and model overfitting. Conversely, all other items were evaluated as low risk, and the applications of the studies were universally ranked as low risk (Table S1 in ).

    Principal Findings

    This systematic review evaluated the performance of AI, particularly DL algorithms, in detecting and managing GA secondary to dry AMD using noninvasive imaging modalities. Our findings demonstrate that AI models exhibit strong capabilities in accurately detecting, segmenting, quantifying, and predicting GA progression from OCT, FAF, CFP, and NIR imaging, achieving diagnostic accuracy comparable to that of human experts. However, this review also identified several methodological challenges, such as limited sample sizes, inconsistent annotation standards, and a general lack of external validation, which may hinder the clinical generalizability and practical application of these models. Despite these limitations, AI-based tools show significant potential for future use by both specialists and nonspecialists in primary and specialty care settings.

    AI in Detecting GA With OCT, FAF, NIR, and CFP Images

    Ten studies published between 2018 and 2025 were included, involving at least 7132 participants aged 50 to 85 years. Half of the studies were conducted in the United States, while others originated from European countries. SD-OCT was the most frequently used imaging modality (6/10 studies), followed by CFP (2/10 studies), NIR (1/10 studies), and FAF (1/10 studies). Image preprocessing techniques, such as standardization of size, orientation, and intensity, as well as noise reduction, were consistently applied to enhance model stability and training efficiency. However, 3 studies did not report critical image parameters, such as resolution, potentially limiting reproducibility. DL-based algorithms, including CNNs, were the primary methodologies used for GA detection. Cross-validation techniques, such as 5-fold and 10-fold methods, were used in half of the studies to assess model robustness, though 3 studies did not report validation strategies. AI, particularly DL algorithms, holds significant promise for the detection of GA using noninvasive imaging modalities. OCT, CFP, NIR, and FAF each demonstrated robust diagnostic potential, with performance metrics rivaling or exceeding human expertise.

    AI for GA Management With OCT, FAF, and NIR Images

    A total of 20 studies (14,064 participants) were published between 2019 and 2025, focusing on themes such as GA segmentation, classification, quantification, and progression prediction. The research designs and geographic regions are diverse. The studies included retrospective analysis (9/20), model development (7/20), and prospective, comparative, or cross-sectional studies (4/20). Significant contributions came from China (6/20) and the United States (7/20), with additional studies from the United Kingdom (2/20), Australia (2/20), France (1/20), Israel (1/20), and Austria (1/20). The studies used a variety of imaging modalities to assess GA, including SD-OCT, FAF, NIR, SS-OCT, and 3D-OCT. DL algorithms demonstrated remarkable performance in GA management tasks. U-Net was the most commonly used architecture. Multimodal approaches combined FAF and NIR images with DL networks to improve segmentation accuracy. Performance metrics, such as DSC, Kappa, SEN, SPE, and accuracy, consistently showed strong diagnostic accuracy, with several studies achieving performance comparable to clinical experts.

    Eleven studies with 6706 participants, published between 2021 and 2025, concentrated on the application of AI for predicting and segmenting GA lesions, as well as their growth and progression. The methodologies were diverse, including retrospective studies, model development studies, post hoc analyses, and clinical algorithm assessment. Participants or images were gathered from regions such as the United States, Australia, Switzerland, and various centers in China and the United States, ensuring broad geographic representation. Demographic information was reported in fewer than half of the studies, with a mean age ranging from 74 to 83 years. Imaging modalities, such as 3D-OCT, SD-OCT, NIR, and FAF, were obtained from devices including Bioptigen, Heidelberg Spectralis HRA+OCT, and Cirrus OCT. While the image preprocessing parameters were consistent across most studies, some did not specify image resolution. Multiview CNN architectures and advanced frameworks, such as the bi-directional long-short term memory networks, were used. DL algorithms exhibited excellent predictive capabilities, with multimodal approaches enabling individualized GA lesion growth prediction.

    Noninvasive Image Analysis Techniques for GA

    GA, a late-stage form of dry AMD, is marked by the irreversible loss of photoreceptors, RPE, and choriocapillaris [,]. The application of noninvasive imaging modalities has revolutionized the detection and management of GA. A comparative summary of AI performance across these modalities is provided in Table S2 in . CFP serves as a standard initial assessment tool, useful for screening and early detection. It identifies GA lesions as visible underlying choroidal vessels and well-defined regions of RPE hypopigmentation []. FAF imaging using a blue excitation wavelength (488 nm) visualizes metabolic changes at the level of photoreceptor or RPE complex and is practical in assessing GA lesion size and progression with hypo-autofluorescence []. In contrast to nonatrophic areas, GA lesions on NIR (787-820 nm, longer than FAF) typically appear brighter and less harmful to the eye []. In addition, NIR can help detect the boundaries of foveal lesions, where image contrast is lower on FAF []. Recently, the Classification of Atrophy Meeting group recommended that atrophy in both patients with and those without neovascular AMD be defined based on specific drusen characteristics and other anatomical features, and it is most easily characterized by OCT [,]. OCT stands out as the gold standard for GA detection and classification, providing high-resolution, cross-sectional, and en face images of the retina and choroid. SD-OCT is widely used in research and clinical trials, offering precise measurement of GA area and growth rates, while SS-OCT and 3D-OCT offer superior structural insights and potential for AI-driven automation [,,]. Despite the higher cost and technical complexity of advanced OCT technologies, their detailed GA assessment capabilities make them indispensable tools in both clinical practice and research. Furthermore, OCT provides volumetric (3D) structural data, unlike the 2D en face projections of FAF, CFP, and NIR. It allows AI to learn not just the surface appearance of atrophy but also the cross-sectional structure alterations that define and precede GA []. As technology advances, the integration of AI and further developments in imaging techniques are expected to enhance the utility of these modalities, overcoming current limitations and expanding their applications in ophthalmology.

    Advantages and Challenges of AI Architectures in Clinical Workflow

    AI addresses critical limitations of traditional GA monitoring, such as labor-intensive manual grading and intergrader variability []. Therefore, automated algorithms enable rapid, standardized analysis of large fundus image datasets, reducing clinician workload and enhancing reproducibility []. Furthermore, our review revealed a clear trend in the choice of model architectures tailored to specific clinical tasks. A critical analysis of these architectures is provided in Table S3 in . Interestingly, with the advancement of AI algorithm architectures, numerous studies have emerged that use these technologies to identify atrophy caused by various retinal diseases and to evaluate treatment outcomes through image analysis. Miere et al [] pretrained a DL-based classifier to automatically distinguish GA from atrophy secondary to inherited retinal diseases on FAF according to etiology, using 2 approaches (a trained and validated method and a 10-fold cross-validation method), achieving good accuracy and excellent area under the receiver operating characteristic (AUROC) values. In addition, a study examined the association between treatment and changes in photoreceptor lamina thickness in patients with GA secondary to AMD. The effect of pegcetacoplan on photoreceptors in OCT was supported by this post hoc analysis, which demonstrated that treatment with the drug was linked with reduced outer retinal thinning []. Similarly, DL-based OCT image analysis assessed the therapeutic effectiveness of complement component 3 inhibition in delaying GA progression, with findings indicating decreased photoreceptor thinning and loss []. Recent studies demonstrating the application of AI algorithms in imaging further validate their potential as reliable supplements to human expertise in the diagnosis and management of GA.

    Technical Challenges and Limitations

    Despite the promising advancements in AI for GA detection and management, several technical challenges and limitations persist. A significant limitation of OCT-based AI models is their difficulty in distinguishing GA secondary to AMD from other forms of retinal atrophy; thus, the findings may not generalize to broader AMD cases or other retinal diseases, which limits their clinical applicability. In addition, images from different OCT devices show significant variability and imprecision, not offering good enough data acquisition []. Another major challenge is the variability in algorithm performance caused by differences in training data, image acquisition protocols, and disease definitions. These differences reduce reproducibility and limit practical deployment. For instance, the absence of standardized reporting in AI studies can result in discrepancies when interpreting results and hinder comparisons between different models. Moreover, despite the high-performance metrics (eg, SEN, SPE, DSC>0.85, and AUROC>0.95) reported by many studies, methodological limitations remain. All diagnostic studies included in this review were assessed as high risk in at least 1 domain (10/10), only 1 GA assessment study (1/20) was evaluated as low risk across all domains, and several prediction studies (7/11) were ranked as high or unclear risk in at least 1 domain, primarily due to small or nonrepresentative datasets and a lack of detailed reporting on image preprocessing and external validation. These methodological shortcomings may lead to an overestimation of AI model performance and reduced overall robustness, thereby decreasing the generalizability of the findings and limiting confidence in their real-world applicability. Future studies should prioritize the use of larger, more diverse datasets and implement rigorous validation frameworks to enhance performance metrics (including detection, segmentation, quantification, and prediction accuracy) and conduct prospective, multicenter validation studies to improve clinical applicability and generalizability. Furthermore, adherence to established reporting guidelines for AI studies (such as the Standards for Reporting Diagnostic Accuracy-AI and Checklist for Artificial Intelligence in Medical Imaging [,]) would improve comprehension and transparency, allow for more meaningful comparisons between systems, and facilitate meta-analyses.

    Real-World Implications and Research Contributions

    Overall, despite some limitations, AI is constantly evolving and holds great potential for transformation in the health care sector []. AI has the potential to accelerate existing forms of medical analysis; however, its algorithms require further testing to be fully trusted. Clinically, AI-based automated tools show strong potential to facilitate early detection, precise quantification, progression, and prediction of GA, thereby reducing the burden on retinal specialists and improving diagnostic consistency. Furthermore, DL algorithms have demonstrated effectiveness in identifying retinal image features associated with cognitive decline, dementia, Parkinson disease, and cardiovascular risk factors []. These findings indicate that AI-based retinal images hold promise for transforming primary care and systemic disease management. Although most AI applications remain in the validation phase, the integration of AI with multimodal imaging, novel biomarkers, and emerging therapeutics holds promise for transforming clinical management paradigms in GA and advancing personalized medicine. Future efforts should focus on developing standardized datasets, improving algorithmic generalizability, and conducting real-world validation studies to fully integrate AI into routine ophthalmic practice.

    Conclusion

    AI, especially DL-based algorithms, holds considerable promise for the detection and management of GA secondary to dry AMD, with performance comparable to trained experts. This systematic review synthesizes and critically appraises the current evidence, highlighting that AI’s capabilities extend across GA management—from initial detection and precise segmentation to the forecasting of lesion progression, which informs future research directions. Meanwhile, with the development of C5 inhibitors, AI-based noninvasive fundus image analysis is expected to detect, identify, and monitor GA at an early stage, thereby increasing the window of opportunity in the future. AI has strong potential to augment and streamline clinical workflows by offering automated, reproducible analysis that can assist clinicians in managing large volumes of imaging data; however, more studies are needed to further validate its effectiveness, repeatability, and accuracy.

    The authors declared that artificial intelligence (AI) or AI-assisted technologies were not used in the writing process of this manuscript.

    This research was funded by the Central High-Level Traditional Chinese Medicine Hospital Project of the Eye Hospital, China Academy of Chinese Medical Sciences (grant no GSP5-82); the National Natural Science Foundation of China (grant no 82274589); the Science and Technology Innovation Project, China Academy of Chinese Medical Sciences (grant no CI2023C008YG); the Institute-level Research Launch Fund of the Eye Hospital, China Academy of Chinese Medical Sciences (grant no kxy-202402); and the Special Project for the Director of the Business Research Office (grant no 2020YJSZX-2).

    All data generated or analyzed during this study are included in this published article and its multimedia appendix files.

    None declared.

    Edited by Amaryllis Mavragani, Stefano Brini; submitted 26.Jul.2025; peer-reviewed by Jiale Zhang, Xiaolong Liang; final revised version received 11.Oct.2025; accepted 11.Oct.2025; published 21.Nov.2025.

    © Nannan Shi, Jiaxian Li, Mengqiu Shang, Weidao Zhang, Kai Xu, Yamin Li, Lina Liang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 21.Nov.2025.

    This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

    Continue Reading

  • Journal of Medical Internet Research

    Journal of Medical Internet Research

    Digital Interventions in Youth Mental Health

    Despite having the greatest level of need, young adults have the worst access to timely and quality mental health care []. Both before and after the COVID-19 pandemic, there is robust evidence that demand for youth mental health support significantly outstrips availability in both the health care and education systems []. In the context of a sharp rise in help-seeking, digital health interventions (mental health supports that are delivered via web-based or mobile-based platforms) offer enormous potential to improve outcomes, to widen access, and to meet the increasing demand on mental health services.

    Several meta-analyses have been conducted on digital interventions (mostly focused on cognitive behavioral therapy [CBT] or “third-wave” cognitive interventions) that address depression and anxiety in young adults. An umbrella review by Harith and colleagues [] found evidence to support the use of digital interventions, but noted that effectiveness was greatly dependent on the delivery format and the mental health problem targeted. Furthermore, Harith et al [] noted that despite young people (as “digital natives”) frequently expressing a preference for the internet as a source of seeking health-related information to address or solve health problems, engagement with and adherence to digital health interventions is often suboptimal.

    Moderated Online Social Therapy

    One approach to improving mental health recovery in young adults is Moderated Online Social Therapy (MOST) []. MOST was initially developed as a digital mental health platform to provide a low-intensity, cost-effective, and engaging approach to prolonging the benefits of specialized Early Intervention for Psychosis (EIP) services []. MOST has shown benefits in terms of return to education and employment among participants, decreased need for emergency care, and has shown to be cost-effective from both the health care sector and societal perspective []. MOST has since been trialed in young adults in single-arm studies of help-seeking young people aged between 16 and 25 years in Australia [] and the Netherlands [], and in young people with depression [], social anxiety [], at high risk of psychosis [], at increased risk of suicide [], with borderline personality disorders [], and in a large-scale national study of young adults in Australia [], with small to large benefits observed for social function and symptom severity. As a digital intervention, MOST consists of both evidence-based online therapy content supported by therapist contact and a Facebook-style community supported and moderated by peer support workers. Evidence of the acceptability of MOST for young people has been reported in a number of studies, including in young people with social anxiety [], emerging mental health issues [], and psychosis [].

    The design and therapeutic content of MOST is strongly influenced by self-determination theory [], an empirically supported theory of motivation, which focuses on the processes and social environments that facilitate or hamper social functioning. In terms of engagement and adherence, MOST differs from other “self-help” style digital interventions by providing access to therapeutic content online that is supported by access to a therapist. It is further supplemented by social supports, including peer support and an online community. Providing these face-to-face supports is likely to improve engagement, which is identified as a major barrier to the use of digital interventions [,,].

    Student Mental Health

    In many European countries, young adults remain in some form of education until their early twenties. Based on the figures published in 2022 by the Organisation for Economic Co-operation and Development, 54% of 18‐ to 24-year-old young adults are in some form of third-level education, rising to 59% in Europe, and up to 63% in Ireland []. According to a 2018 World Health Organization study of ~14,000 students from across 8 countries, approximately one in three screened positive for at least one common DSM-IV (Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition) anxiety, mood, or substance disorder []. Similarly, Sheldon et al [], in a meta-analysis of third-level students, reported a pooled prevalence of depression at 25% and suicide-related outcomes at 14%. Taken together, these data suggest that education settings such as universities and colleges represent a key location for the development and delivery of mental health interventions. As noted above, however, access to young adult mental health services is often limited, and particularly so in university mental health services, where a student may at best have access to short-term (1-4) sessions of counseling. In this context, MOST may provide a means to supplement existing 1:1 therapy in a scalable and cost-effective manner.

    Objectives

    The purpose of this study was to provide information about the feasibility of conducting a randomized trial of MOST in young people who recently attended a university counseling service, along with preliminary data regarding the efficacy of MOST for the purpose of a definitive randomized controlled trial. This study was carried out as part of a funded program of research entitled “Improving Psychosocial Supports in Youth Mental Health” (the PSYcHE program).

    Ethical Considerations

    This study was approved by the Galway Clinical Research Ethics Committee, Merlin Park Hospital, Galway, Ireland (reference CA2468). All participants provided informed written consent, and protocols were put in place for the proposed management of vulnerable individuals in the study. Participants were reimbursed €25 (approximately US $29) for each assessment (see below). The ethics application also detailed General Data Protection Regulation (GDPR) considerations including the pseudonymization of data and data management practices to ensure the privacy of participants. The trial was registered with ISRCTN (number 15520701).

    Setting and Inclusion Criteria

    We conducted this prospective, assessor-blind, randomized controlled pilot study at the student counseling service of the University of Galway, Ireland, which serves just over 18,000 students. We aimed to include students who attended the service with persistent mental health difficulties of at least 12 months in duration. The rationale for this was to focus on a more homogeneous sample of young people seeking help for mental health difficulties rather than more transient difficulties causing distress. Inclusion criteria were being aged between 18 and 35 years, self-reporting mental health difficulties of longer than 1 year in duration, being clinically stable, and having the ability to give consent. Clinical stability in this context was determined by the referring counselor. A participant was considered eligible for MOST based on the criteria that the participant was finishing their attendance at the counseling service and no longer required the 1:1 counseling support they had been receiving. Exclusion criteria were a history of organic impairment (including IQ<70), a history of head injury with loss of consciousness >5-minute duration, and substance abuse in the preceding month.

    Referrals to the study were made electronically by the student counselor toward the end of the participant’s short-term 1:1 counseling sessions (~4 sessions were provided). Once the referral was received, a research assistant completed the informed consent procedure, after which an initial screening was conducted. Following screening against inclusion and exclusion criteria, eligible participants were enrolled into the study and completed a baseline assessment. See for a breakdown of recruitment across the study period.

    Randomization and Masking

    Procedure

    Following baseline assessment, participants were randomly allocated to their treatment group. Randomization was implemented using Sealed Envelope [] at a rate of 2:1 intervention versus control. A block design approach was taken to account for gender, with 6 participants per block. Research assessors (master’s level psychology research assistants) were masked to group allocation. Participants were informed about their trial allocation by a research assistant who was independent from the research assessors and whose role was to “onboard” participants onto the MOST intervention platform. To ensure that assessors remained blind to treatment status, research assessors were asked to guess the treatment arm of the participant following each assessment. Assessors correctly guessed intervention status only 39% of the time (12/31 guesses made; χ2=2.62, P=.11), suggesting that assessors were in fact blinded to status.

    Intervention Group
    Overview

    For those randomized to the MOST arm, an onboarding process was completed, during which participants were registered with the platform and given a guided tour. MOST has been described elsewhere [,,]. In brief, participation in the platform consisted of (1) engaging with a therapy “journey”; psychotherapeutic content automatically tailored for each participant based on their response to a questionnaire completed as part of the onboarding, further discussion below, (2) support with participation on the therapy journey from a therapist in the form of fortnightly ~15-minute video or telephone calls, (3) a community wall, see below, and (4) peer support for engagement in the community wall in the form of fortnightly ~30-minute video or telephone calls. Clinical and peer support workers followed established protocols provided by the MOST platform developers. Clinicians and peer support workers also met with the study principal investigator, as a group, on a monthly basis. During these meetings, engagement with each participant was reviewed and assessed according to the therapist and peer support manuals.

    Therapy Journey

    The Therapy Journey took the form of interactive online therapy modules (focusing on anxiety, social anxiety, depression, and social interaction) based on third-wave CBT and primarily targeting social functioning by, for example, fostering self-efficacy (identifying personal strengths based on the strengths-based framework), positive emotions and subjective well-being (eg, practicing mindfulness and self-compassion), or positive connections with others (eg, focusing on empathy skills). Participants’ engagement and application of this content in daily life is supported by 5 activity types: comics, reflective actions, actions, talking points, and pages. Comics are illustrated multipaneled narratives that bring therapeutic concepts to life via recurring characters, reflective actions are prompts for reflection, actions suggest a practical step (eg, behavioral experiment), talking points prompt young people and peer workers to post their thoughts and reactions to the content, and pages summarize each track and provide psychoeducation. Users have the option to save activities to a “toolkit,” so they have an accessible, personalized, and labeled bank of strategies when needed.

    Community Interaction

    The MOST community took the form of an online social network to foster social support. Participants are encouraged to communicate with one another and with peer and expert moderators. This is in order to foster a sense of connection as well as combat loneliness and self-stigma []. The community includes a “feed” page that allows participants to post text, images, and links to be read and responded to by other members of the community. This feed is only available to others on the MOST platform and is moderated by clinicians and led by peer-support workers with lived experience and informed by the evidence-based problem-solving framework []. A further feature of MOST is an online group function, referred to as “Talk it Out,” which enables users to nominate issues (eg, “how to break through shyness and make new friends?”), which are discussed in moderated groups through structured problem-solving phases (eg, brainstorming, pros and cons, and wrap-up).

    Control Group

    Those randomized to the control arm of the study continued to receive care as usual. As participants entered the trial after attending the student counseling service, they were free to seek help from usual supports both internal and external to the university (student medical services, etc). However, control participants were not provided additional supports through the trial, and in a majority of cases, were not receiving other therapeutic intervention during the period assessed. Control participants could be onboarded to MOST following their 26-week assessment.

    Outcomes Assessed

    As a pilot study, our main outcome metrics related to the feasibility of the trial included the number of participants recruited, their engagement with the treatment, and their retention at the follow-up assessment period. In addition, we also aimed to establish the feasibility of our primary and secondary outcome measures, along with some indication of treatment effect estimates that might be expected for these measures, to inform power calculations for a full trial.

    Outcome measures were administered to participants in both groups by assessors blinded to intervention allocation at baseline, 12 weeks, and 26 weeks. Participants were reimbursed €25 (approximately US $29) for their time for each assessment. The overarching aim of the PSYcHE research program under which this study was carried out was to improve psychosocial function. As such, our main outcome variable was social and occupational function. As described in detail previously [], identifying suitable measures of social and occupational function is complex, agreeing with the adage that simple measures are not accurate and accurate measures are not simple. As a result, for the purposes of this study, we included two separate measures of social and occupational function. The first was the Social and Occupational Functioning Assessment Scale (SOFAS) [], an interviewer-rated global assessment of social and occupational function. The second measure was the Time Use Survey (TUS) [], an interviewer-rated assessment of constructive economic activity and structured activity.

    Given that improved social and functional outcomes may relate changes in cognitive and social cognitive function (a hypothesis of the broader PSYcHE program), two measures of cognitive and social cognitive function were included—the Weschler Logical Memory task [], which is a brief measure of verbal episodic memory, and the Reading the Mind in the Eyes Test (RMET) [], which is a brief measure of social cognition, measuring theory of mind.

    In terms of clinical variables, we initially intended to assess the feasibility of loneliness, as measured by the UCLA (University of California, Los Angeles) Loneliness Scale []. However, studies published soon after the start of the trial, both of MOST and of other digital interventions, indicated that some of the largest effects might be observed on measures of anxiety, mood, and distress (eg, [,,]). Consequently, the trial protocol was amended after 6 months to also include additional clinical measures. That is, a measure of anxiety (Generalized Anxiety Disorder-7; GAD-7) [] and of depression (Patient Health Questionnaire-9; PHQ-9) []. These were therefore available for 51 of the 74 participants, 32 in the intervention arm and 19 in the control arm.

    Intervention Engagement Data

    Engagement data were extracted from the MOST platform once participants had completed 26 weeks on the platform. Data extracted included time spent, in minutes, working through therapeutic content as well as the number of therapeutic content activities completed, community activity (posts, comments, and reactions on the community wall), and texts and calls with therapists and peer support workers. The number of weeks spent engaging with the intervention was also calculated for each participant. Engagement was estimated as the combined total number of weeks that each participant logged into the MOST platform and moved beyond the homepage, or engaged in a call with their assigned therapist or peer support worker.

    Evidence around engagement with digital mental health interventions, including definitions of engagement and thresholds for engagement, is inconsistent and varied depending on the type of digital intervention, the participants involved, and the context in which the intervention is implemented [-]. As previous studies have measured engagement in different ways, leading to uncertainty about what engagement could be expected, we did not set an a priori indicator of engagement. In the absence of an agreed-upon measurement of engagement, we adopted a pragmatic approach informed by adherence literature as well as rates of engagement in previous digital interventions [-]. As such, we defined engagement as active use of one or more aspects of the MOST intervention and applied the following thresholds: We considered minimal engagement as >20%, partial engagement as 50%, and full engagement as >80%. For a 26-week trial, 20% engagement approximates to engagement for 5 or more weeks, and 50% as engagement for more than 12 weeks.

    Statistical Analysis

    Formal sample size calculations were not performed for this pilot study. Instead, the target sample size was based on recruitment of an adequate sample size for a pilot study conducted for the purposes of establishing parameters for a definitive study. Previous guidance on sample size estimation for pilot studies [,] has suggested a minimum of 60 participants as an adequate sample. A period of 36 months was initially proposed as a timeframe to recruit participants into the trial. This was based on funding, and also allowing for the fact that the majority of students were on campus for ~7 months of the year. As such, we sought to recruit a minimum average of 3 students per university term month, approximately 20 students per year of the trial.

    To assess differences between groups, analyses of covariance (ANCOVAs) were carried out to obtain adjusted mean differences between groups with a 95% CI, while accounting for baseline variables. Effect sizes were also calculated by taking the β coefficients of the treatment arm from the ANCOVAs and dividing them by the pooled baseline SD for each measure, respectively. The analysis plan did not include reporting of P values; this was based on recommendations that, as pilot studies are not fully powered, interpretation of results should be done with caution and the analysis should be designed to inform future trials rather than hypothesis testing []. Effect sizes were reported using the standard cutoffs of small (d=0.2), medium (d=0.5), and large (d=0.8) []. All analyses were completed at the end of the last follow-up assessments and were based on the intention-to-treat population. Analyses were carried out between the intervention and control groups as well as between participants who engaged for 5 or more weeks (minimum threshold for engagement) and the control participants. Analyses were conducted using SPSS v29 (IBM Inc).

    Recruitment and Sample Description

    Recruitment

    Our initial target was to recruit 60 participants over a 36-month period, representing a recruitment rate of 20 participants per year. We expected that most of the recruitment would occur during the 7 months of term time, such that we would be required to recruit ~3.3 participants per month of term (semester) time (~1.67 participants per calendar month) to achieve this target. The actual number of participants recruited was 74 over the 44 months (extended by 8 months due to the COVID pandemic) between April 2021 and December 2024 (see for the CONSORT [Consolidated Standards of Reporting Trials] participant flowchart []). In terms of the university semester, this equates to ~3.1 participants recruited for each month of term time or ~1.68 participants per calendar month (see for recruitment numbers by month).

    Figure 1. Participant recruitment and retention. MOST: Moderated Online Social Therapy.
    Sample Description

    A demographic and clinical description of the sample is provided in . The sample had a mean age of 22.69 years (SD=5.34) and 72% (53/74) identified as female. Nearly 68% (50/74) of the sample lived in shared accommodation, while 24% (18/74) continued to live at home. Predictably, for a sample recruited from a student cohort, all but 2 participants were in full- or part-time education at the time of baseline assessment, with those 2 participants having just left education since initial contact.

    Table 1. Demographic and clinical description of the sample at baseline assessment.
    Intervention group (n=47) Control group (n=27)
    Age (years), mean (SD) 22.77 (6.12) 22.56 (3.71)
    Gender (male:female:genderflux:other) 12:32:1:2 5:21:0:1
    Current education status, n (%)
     Not in education 1 (2) 1 (4)
     Part time 2 (4) 3 (11)
     Full time 44 (94) 23 (85)
    Accommodation, n (%)
     Lives with parents 13 (28) 5 (19)
     Lives with others 30 (64) 20 (74)
     Lives alone 1 (2) 1 (4)
    Presenting problem (self-report), n (%)
     Anxiety 20 (43) 13 (48)
     Mood 10 (21) 7 (26)
     Academic 6 (13) 5 (19)
     Relational 4 (9) 1 (4)
     Behavioral 3 (6) 0 (0)
     Other 4 (9) 1 (4)
    Duration of difficulties (months), mean (SD) 44.45 (40.66) 52.33 (44.83)
    Clinical measures, mean (SD)
     GAD-7 11.12 (4.77) 11.63 (4.93)
     UCLA Loneliness Scale 47.96 (11.3) 48.04 (10.33)
     PHQ-9 12.32 (5.06) 13.63 (5.11)
    Risk of alcohol and drug dependency, mean (SD)
     AUDIT 7.29 (6.38) 7 (4.53)
     DUDIT 2.26 (4.62) 2.67 (4.53)
    Predicted general cognitive ability, mean (SD)
     TOPF 104.16 (8.72) 104.93 (6.63)

    aGAD-7: Generalized Anxiety Disorder-7.

    bUCLA: University of California, Los Angeles.

    cPHQ-9: Patient Health Questionnaire-9.

    dAUDIT: Alcohol Use Disorders Identification Test.

    en=45.

    fDUDIT: Drug Use Disorders Identification Test.

    gn=46.

    hTOPF: Test of Premorbid Functioning.

    In terms of duration of mental health difficulties, reflecting the inclusion criteria of experiencing difficulties for at least 12 months, participants subjectively reported duration of difficulties as ranging from 12 to 165 months (13.8 years) in the intervention arm and 12‐192 months (16 years) in the control arm. The most commonly self-reported difficulties were anxiety (33/74, 45%), mood (17/74, 23%), academic difficulties (11/74, 15%), and relationship difficulties (5/74, 7%). In terms of alcohol use—measured by the Alcohol Use Disorders Identification Test (AUDIT) []—59% (44/74) were classified as “low risk,” 30% (22/74) as “low risk” or “increasing risk,” and 7% (5/74) as “possibly dependent” on alcohol. In terms of drug use—measured by the Drug Use Disorders Identification Test (DUDIT) []—all participants were estimated as “probably not” substance dependent and none as probably heavily dependent.

    In terms of randomization, which was set at 2:1 intervention to control, 47 (64%) of the 74 participants were assigned to the intervention arm of the study. No randomization errors were identified during the trial period.

    Rates of Retention and Engagement

    Retention

    This was defined as the percentage of participants who completed outcome measures at 12 and 26 weeks. Our initial target for retention was 75% at 12 and 26 weeks, with a threshold target of 70% required to proceed to a full randomized controlled trial without amendment of the study. In terms of actual retention rates in the study, 12-week assessments were collected for 52 (70%) of the 74 participants, representing 30 (64%) of the 47 intervention participants and 22 (81%) of the 27 participants in the control arm. Twenty-six-week assessments were available for 49 (66%) out of the 74 participants. This represented 28 (60%) of the intervention group and 21 (78%) of the control group. In summary, this reflects an overall dropout rate of 30% at 12-week follow-up, increasing marginally to 34% at 26 weeks.

    Rates of Engagement

    As described above, engagement was captured in terms of discrete activities on MOST (see ) and then categorized in terms of number of weeks of engagement in the intervention. Participants were categorized according to their level of engagement using the following thresholds: minimal engagement ≥5 weeks, 20% of the intervention period; partial engagement ≥12 weeks, 50% of the intervention period; and full engagement ≥21 weeks, 80% of the intervention period. Of the 47 participants in the intervention arm, we observed that 38 (81%) engaged with the program for 5 or more weeks, 39 (83%) were engaged in the first 12 weeks, and 35 (74%) continued to engage beyond 12 weeks (see ). Finally, to allow future comparison with other studies of MOST and other digital interventions, engagement was also recorded and summarized in terms of min-max; median and mean for total activity time; number of therapy journey activities completed; community posts made, commented on, or reacted to; and number of calls with the therapist and peer support worker. These are provided in .

    Outcome Measures

    shows the group means and SDs, adjusted mean differences, and the effect sizes for those differences between the intervention group and the control group for all outcome measures. shows the same descriptive values for participants from the intervention group who had a minimum of 20% engagement on MOST, compared to the control group.

    Table 2. Comparison of outcome data for full sample.
    Intervention group Control group Adjusted mean difference (95% CI) Effect size (d)
       Participants, n Mean (SD) Participants, n Mean (SD)
    Primary outcome measures
     Social and occupational function
      SOFAS
       0 weeks 47 79.38(8.26) 27 77.89 (8.86)
       12 weeks 29 81.48 (9.83) 22 78.45 (10.82) –2.36 (–8 to 3.27) –0.28
       26 weeks 24 82.79 (10.26) 21 83.52 (7.7) 1.08 (–4.09 to 6.25) 0.13
      TUS constructive economic activity
       0 weeks 47 43.77 (19.13) 27 45.3 (29.52)
       12 weeks 30 36.42 (15.26) 22 48.32 (32.04) 10.86 (–1.66 to 23.37) 0.47
       26 weeks 28 38.18 (19.52) 21 42.85 (14.91) 4.36 (–6.06 to 14.78) 0.19
      TUS structured activity
       0 weeks 47 53.62 (19.33) 27 53.67 (28.23)
       12 weeks 30 46.23 (16.08) 22 59.63 (31.38) 12.74 (0.66 to 24.83) 0.56
       26 weeks 28 47.91 (21.43) 21 52.49 (16.09) 4.13 (–7.08 to 15.34) 0.18
    Secondary outcome measures
     Cognitive and social cognitive function
      RMET
       0 weeks 47 27.19 (4.3) 27 27.11 (3.48)
       12 weeks 30 27.1 (4.36) 22 25 (4.16) –0.93 (–2.67 to 0.81) –0.23
       26 weeks 28 28.21 (4.3) 21 27.48 (3.49) –0.11 (–1.98 to 1.76) –0.03
      Logical memory
       0 weeks 47 9.38 (2.89) 27 8.78 (3.06)
       12 weeks 17 8.82 (2.7) 16 8.81 (2.4) 0.02 (–1.34 to 1.38) 0.01
       26 weeks 28 9.54 (2.91) 21 8.86 (3.29) –0.69 (–2.12 to 0.75) –0.23
     Clinical measures
      UCLA Loneliness Scale
       0 weeks 46 47.96 (11.3) 26 48.04 (10.33)
       12 weeks 30 45.23 (13) 22 44.14 (12.73) –3.43 (–9.55 to 2.69) –0.31
       26 weeks 28 40.14 (13.23) 21 42.48 (12.38) –1.44 (–7.64 to 4.77) –0.13
      PHQ-9
       0 weeks 31 12.32 (5.06) 19 13.63 (5.11)
       12 weeks 20 10 (6.1) 19 10.05 (4.89) –1.72 (–5.46 to 2.02) –0.34
       26 weeks 21 8.9 (5.8) 18 9.83 (3.7) 0.54 (–2.92 to 4) 0.11
      GAD-7
       0 weeks 32 11.12 (4.77) 19 11.63 (4.93)
       12 weeks 20 9.6 (6.2) 19 9.32 (4.77) –2.36 (–5.58 to 0.85) –0.50
       26 weeks 21 8.1 (5.32) 18 8.28 (5.07) 1.08 (–2.64 to 4.81) 0.23

    aSOFAS: Social and Occupational Functioning Assessment Scale.

    bNot applicable.

    cTUS: Time Use Survey.

    dRMET: Reading the Mind in the Eyes Test.

    eWeschler logical memory task.

    fUCLA: University of California, Los Angeles.

    gPHQ-9: Patient Health Questionnaire-9.

    hGAD-7: Generalized Anxiety Disorder-7.

    Table 3. Comparison of outcome data between participants who engaged for a minimum of 5 weeks and control participants.
    Intervention group (>5 weeks engagement) Control group Adjusted mean difference (95% CI) Effect size (d)
       Participants, n Mean (SD) Participants, n Mean (SD)
    Primary outcome measures
     Social and occupational function
      SOFAS
       0 weeks 38 79.5 (8.3) 27 77.89 (8.86)
       12 weeks 27 81.22 (10.05) 22 78.45 (10.82) –2.16 (–7.96 to 3.64) –0.25
       26 weeks 22 81.82 (10.15) 21 83.52 (7.7) 1.91 (–3.34 to 7.16) 0.22
      TUS constructive economic activity
       0 weeks 38 46.27 (19.91) 27 45.3 (29.52)
       12 weeks 28 35.6 (15.41) 22 48.32 (32.04) 11.76 (–1.12 to 24.64) 0.49
       26 weeks 26 37.36 (20.04) 21 42.85 (14.91) 5.18 (–5.56 to 15.92) 0.21
      TUS structured activity
       0 weeks 38 55.74 (19.82) 27 53.67 (28.23)
       12 weeks 28 45.7 (16.36) 22 59.63 (31.38) 13.24 (0.73 to 25.75) 0.56
       26 weeks 26 46.58 (21.41) 21 52.49 (16.09) 5.41 (–5.89 to 16.72) 0.23
    Secondary outcome measures
     Cognitive and social cognitive function
      RMET
       0 weeks 38 27.39 (4.48) 27 27.11 (3.48)
       12 weeks 28 27.21 (4.48) 22 25 (4.16) –0.87 (–2.66 to 0.93) –0.21
       26 weeks 26 28.08 (4.35) 21 27.48 (3.49) 0.13 (–1.79 to 2.04) 0.03
      Logical memory
       0 weeks 38 9.39 (3.12) 27 8.78 (3.06)
       12 weeks 16 8.75 (2.77) 16 8.81 (2.4) 0.1 (–1.29 to 1.5) 0.03
       26 weeks 26 9.58 (2.97) 21 8.86 (3.29) –0.7 (–2.18 to 0.78) –0.23
     Clinical measures
      UCLA Loneliness Scale
       0 weeks 37 48.41 (11.72) 26 48.04 (10.33)
       12 weeks 28 46.32 (12.67) 22 44.14 (12.73) –4.4 (–10.51 to 1.72) –0.40
       26 weeks 26 40.54 (13.47) 21 42.48 (12.38) –1.23 (–7.58 to 5.11) –0.11
      PHQ-9
       0 weeks 25 12.88 (5.15) 19 13.63 (5.11)
       12 weeks 19 10.32 (6.09) 19 10.05 (4.89) –2.23 (–5.94 to 1.47) –0.44
       26 weeks 19 9.05 (6.07) 18 9.83 (3.7) 0.38 (–3.29 to 4.05) 0.07
      GAD-7
       0 weeks 26 11.58 (4.47) 19 11.63 (4.93)
       12 weeks 19 9.53 (6.38) 19 9.32 (4.77) –2.13 (–5.32 to 1.06) –0.47
       26 weeks 19 8.37 (5.47) 18 8.28 (5.07) 1.1 (–2.83 to 5.03) 0.24

    aSOFAS: Social and Occupational Functioning Assessment Scale.

    bNot applicable.

    cTUS: Time Use Survey.

    dRMET: Reading the Mind in the Eyes Test.

    eWeschler logical memory task.

    fUCLA: University of California, Los Angeles.

    gPHQ-9: Patient Health Questionnaire-9.

    hGAD-7: Generalized Anxiety Disorder-7.

    Primary Outcome Measures: Social and Occupational Functioning

    No difficulties were encountered in administering the SOFAS. In terms of group comparisons, there was a small effect on SOFAS scores at 12 weeks for those allocated to MOST (d=−0.28), with an equivalent effect size (d=−0.25) observed when only those who engaged for more than 5 of the 26 weeks (ie, >20%) were included in the intervention arm. Of note, there was no evidence of an effect of MOST on the SOFAS when measured at 26 weeks for either the full intervention group or for those who were at least minimally engaged.

    Two issues emerged in the administration of the TUS. The first of these was COVID-19 related. Across the first 18 months of the recruitment period, time use was significantly altered by restrictions imposed due to the pandemic. The second was the issue of tracking activity over 6 months when students’ activity differed significantly depending on when they were in college during the semester, or during the holiday period between semesters. Consequently, the changes in TUS scores were problematic to interpret in terms of the size of the effect of the intervention.

    Secondary Outcome Measures
    Cognitive and Social Cognitive Assessment

    As widely used cognitive measures, no difficulties were observed in administering either the logical memory task or the RMET. A small effect was observed on the RMET task at 12 weeks for both the intervention group (d=−0.23) and when only those who had at least minimally engaged were assessed (d=−0.21). No effect was observed on the declarative memory scale at 12 weeks. While a small effect at 26 weeks was observed in the full group (d=−0.23), this was not consistent between the full and the >20% engaged group.

    Clinical Functioning

    As already noted, the trial protocol was amended after 6 months to also include the GAD-7 and PHQ-9, which were therefore available for 51 of the 74 participants, 32 in the intervention arm and 19 in the control arm. Across the clinical measures, small to medium effects in favor of the intervention arm were observed at 12 weeks, with effect sizes of d=−0.5 for the GAD-7, d=−0.34 for the PHQ-9, and d=−0.31 for the Loneliness Scale. Comparable effect sizes were observed in the full intervention arm and when only those with >20% engagement were used in the comparison. Again, as with other effect sizes favoring the intervention arm, these effects were not observed at 26 weeks.

    Overview of Findings

    This trial investigates the feasibility of conducting a randomized controlled trial of a moderated online intervention in a university setting. The intervention included online tailored mental health content with support from a therapist and peer-to-peer social mentoring and networking with the aim of improving mental health and social functioning. Based on recruitment at a single site, we achieved our recruitment target of 1.67 participants per month (~3.1 participants per semester month). Retention in the trial was 70% (52/74) at 12 weeks, reducing to 66% (49/74) at 26 weeks. For the intervention group, when engagement was measured in terms of participation in at least one component of the intervention (therapy journey, therapist contact, community wall participation, or peer support contact), 81% (38/47) of the intervention group engaged for 5 or more weeks of the trial (equivalent to at least 20% of the maximum 26 weeks for which the intervention was available to participants).

    While the study was not adequately powered to test for differences between intervention and control groups, calculation of effect sizes associated with mean differences between the intervention group and the control arm indicated a small benefit favoring the intervention group (d=0.28) for the SOFAS measure of social and occupational functioning (but not the TUS). Similar small effects were observed for the secondary cognitive variables of memory function and social cognitive function. Finally, slightly larger (positive) effects were observed on the clinical measures available (d=−0.5 for GAD-7, d=−0.34 for PHQ-9, and d=−0.31 for the Loneliness Scale). Across each of these primary and secondary measures, the effect sizes observed at 12 weeks were similar when either the full intervention group or only those with at least minimal engagement (5 or more weeks) were compared to the control group. However, these effect sizes reduced to less than d=0.1 at 26 weeks.

    Progression to a Full Trial

    As noted above, rates of recruitment were as originally planned, and no difficulties were encountered with randomization procedures and completion of outcome measures (except for the TUS, discussed further below). Retention rates in the trial of 70% (52/74) at 12 weeks were marginally lower than the criteria of 75%, indicating that fewer participants (n=4) were retained than expected. Our criteria of 75% was based on Alvarez-Jimenez et al’s [] original study in patients with early psychosis. As such, this may have been unrealistic for a student population given that previous studies of MOST in a similar sample reported a retention rate of 59% [].

    Outcome Measures Used

    From the point of view of measurement of outcomes, our primary outcome was social and occupational functioning. Social and occupational function is difficult to measure accurately at the best of times []. Additionally, our study coincided with the COVID-19 pandemic, which significantly impacted social and occupational function and time use for many participants during the study. Furthermore, given the student experience of moving between the routine of term time and the social and occupational upheaval associated with the winter and summer breaks, intervention-related changes in functioning were difficult to track accurately. While observer-rated measurement using the SOFAS was able to detect the same level of effects as observed on cognitive and clinical measures, time use did not appear to be a sensitive or reliable measure of change in function. On this basis, other measures of social function might be considered to index change in this domain in a future trial. Qualitative feedback from participants would also be useful to give insight into how social function could best be captured in this sample.

    Among the measures recorded in the study, the largest effect observed was on the GAD-7, a measure of generalized anxiety (d=−0.5). While this was not a primary outcome measure in the trial, measures of anxiety (and mood) are the measures most closely related to the therapy content of MOST given that much of the MOST therapy journeys focus on anxiety, social anxiety, or mood []. It might be expected, therefore, that the largest effects might be observed on these more proximal outcome measures.

    A noticeable difference in the effect sizes can be observed across time points, with some benefits associated with participating in the intervention arm at 12 weeks and not appearing at 26 weeks. In the Alvarez-Jimenez et al [] psychosis sample, no difference in social functioning was apparent between groups at follow-up, whereas a difference in the odds of enrolling in education or finding employment was observed. Given the near full college enrollment in our sample (as a study of students), we were obviously unlikely to see the same educational or occupational benefits. In the Van Doorn et al [] study, the significant difference in social and occupational function, as measured by the SOFAS, did persist at 26 weeks. However, that study represented a single-group design investigating changes within the intervention group over time, as opposed to between-group differences. In this study, it is unlikely that these differences between 12 and 26 weeks are explained by engagement attrition, as most of the attrition occurred in the initial couple of weeks, with 83% (39/47) engaged for at least 5 weeks and 74% (35/47) engaged beyond 12 weeks. Instead, the reduced effect at 26 weeks appears to have reflected the improved scores of the control group at 26 weeks, as indexed by the SOFAS, the social cognitive, and clinical measures. Finally, implementing MOST after participants have attended counseling sessions represents a step-down approach designed to maintain positive treatment gains begun in the initial treatment received [,]. The findings at 12 weeks suggest that this maintenance of gains was achieved in this study. However, the length of the intervention and the level of engagement needed to continue improvement warrant further investigation. This is discussed below.

    Strengths, Limitations, and Future Directions

    The purpose of a pilot study is to identify potential difficulties and avoid these prior to commencing a full trial. In terms of strengths, several aspects of the methodology for evaluating the use of MOST in a young adult student sample were supported in this study, including the feasibility of recruitment and retention, randomization, and a majority of the assessments used at each time point. In terms of the weaknesses of our methodology, we have already noted the difficulty with measuring social and occupational function. Specifically, the TUS might not be a suitable measure for use in a student population. We have also noted that the largest effect sizes for MOST are likely to be observed on clinical outcome measures that relate more directly to the intervention. As previously outlined, the addition of some of the clinical measures came after the trial had begun, and thus, data were missing for some participants. While the rationale for the inclusion of these variables is sound (see outcomes used in other MOST trials as well as digital interventions in this context [,,]), and the results are promising, further evidence is needed to examine the impact of MOST on these outcomes.

    We also note that the dropout rate was slightly higher in the intervention arm compared to the control arm. This is likely to reflect, in part, the time demands of participating in the various components of MOST. One unanswered question following this study is how long students should be expected to participate. As noted in , there were clear significant differences between participants in terms of their interest or willingness to remain involved across the 26 weeks that MOST was offered. While, as noted, the vast majority remained active for more than 5 weeks, the median number of months of involvement was 5 months (mean 4.32, SD 1.95). This information should inform expectations for involvement in the full trial.

    In terms of the implementation of the intervention, fidelity checklists were not used in this trial. A future evaluation of MOST in this context would benefit from adopting a fidelity procedure to ensure consistency in the delivery of the intervention. Such a checklist is being adopted by Mangelsdorf et al [] in their recently begun trial of MOST for young people with depressive symptoms.

    In line with best practice in randomized trials, this study examined the feasibility of comparing participants using MOST with care-as-usual participants. While this approach in a future full trial will be vital for investigating the effectiveness of MOST, it would also be useful to compare MOST to other existing digital mental health interventions. Given the rise in interest in digital interventions both generally and in the university context (see [,]), such a comparison would give insight both to the comparative effectiveness of MOST and would allow for further exploration of attrition rates in MOST, as well as barriers and facilitators to engagement.

    As reported above, 72% (53/74) of the sample identified as female. Literature around the prevalence of mental health difficulties and around help-seeking in young people indicates that more females present with and seek help for mental health difficulties at university than males [,]. However, this overrepresentation of females in the sample limits the generalizability of the results and should be addressed in future studies of MOST.

    Other limitations in this study include potential self-selection bias in terms of participation in the intervention and completion of assessments. As noted in the participant flowchart (), 13 young people declined to participate in MOST, and attrition in the assessments was higher than in the intervention itself, with 39 participants continuing to engage with MOST at 12 weeks, but only 30 opting to complete assessments at this time. It was also outside the scope of this study to report on findings beyond 26 weeks. Further follow-up with participants would not only inform a future trial but would also give insight in terms of the long-term efficacy of MOST. Finally, this trial was impacted by the COVID-19 pandemic. As mentioned, this had an impact on the social and occupational functioning of participants. Furthermore, some assessments took place online due to restrictions, and the pandemic may have had an impact on engagement with MOST, with participants potentially engaging differently than they might have otherwise. Further examination of engagement with MOST is thus warranted.

    Conclusions

    Based on the recruitment, retention, and engagement rates observed, this pilot feasibility study provides evidence for the feasibility of a full randomized controlled trial of MOST with a young adult population. Moreover, the effect sizes favoring the intervention arm are consistent with previous studies, suggesting that MOST may be a potentially beneficial support for youth mental health in the context of further education. This study also highlights important factors that need to be addressed in a full study, such as including measures of anxiety and depression as potentially primary outcome variables, of selecting sensitive measures of social function, and of ensuring sustained engagement in the intervention.

    The authors would like to thank Dillon O’Reilly, Niamh O’Brien, Matthew Toher, Kyra Renaud, Lorcan O’Connor, and Jack Brody for their contributions to the completion of the trial, and to Talissa Walsh, Cathal Ó’Curraoin, and Sophie Mahon for their assistance with trial assessments. Our thanks to Prof Molly Byrne for ongoing advice on trial methodologies. Our thanks to study participants, clinicians, and the Youth Advisory Panel (YAP) for their participation and input. We are grateful to the Moderated Online Social Therapy (MOST) program developers and to the funders of this study. Generative artificial intelligence was not used in this study or in the generation of this manuscript.

    This work was funded by the Irish Health Research Board as part of the Research Leader Award entitled the PSYcHE program to GD (RL-20-‐07). MA-J was supported by an Investigator Grant (APP1177235) from the National Health and Medical Research Council and a Dame Kate Campbell Fellowship from the University of Melbourne. The funder of the study (the Health Research Board, Ireland) had no role in study design, data collection, data analysis, data interpretation, or writing of the report. GD had full access to all the data in the study and had final responsibility for the decision to submit for publication.

    The full protocol, in addition to datasets and statistical code generated during the study, will be available from the corresponding author on reasonable request.

    GD and JMC originated the conception and design of the study. GD, MDOR, SMH, EF, TB, and CH led the trial, and GD, MDOR, and SMH completed the analysis and interpretation of data. All authors reviewed and approved the manuscript.

    MA-J was involved with the development of the Moderated Online Social Therapy (MOST) program but not involved with supervising any of the assessment procedures or data analysis.

    Edited by Javad Sarvestan; submitted 10.Mar.2025; peer-reviewed by Adeleke Adekola, Ali Al-Asadi, Diana Gyimah; final revised version received 13.Jun.2025; accepted 16.Jun.2025; published 21.Nov.2025.

    © Maeve Dwan-O’Reilly, Sophie Mae Harrington, Conor Gavin, Emmet Godfrey, Megan Cowman, Christina Gleeson, Anna O’Mahony-Sinnott, James McCormack, Emma Frawley, Tom Burke, Karen O’Connor, Max Birchwood, Caroline Heary, Mario Alvarez-Jimenez, Gary Donohoe. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 21.Nov.2025.

    This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

    Continue Reading

  • Researchers build a better AI model memory probe • The Register

    Researchers build a better AI model memory probe • The Register

    If you’ve ever wondered whether that chatbot you’re using knows the entire text of a particular book, answers are on the way. Computer scientists have developed a more effective way to coax memorized content from large language models, a…

    Continue Reading

  • Why trouble for the biggest foreign buyer of U.S. debt could ripple through America’s bond market

    Why trouble for the biggest foreign buyer of U.S. debt could ripple through America’s bond market

    By Vivien Lou Chen

    Developments in Japan are creating a risk that investors in the U.S. Treasury market may one day pull the rug out by keeping more of their savings at home

    Why turmoil around Japan’s new government could wash up in U.S. financial markets.

    Recent developments overseas have the potential to complicate the White House’s agenda to bring down borrowing costs, while heightening competition for investors in the U.S. and Japanese bond markets.

    Aggressive fiscal-stimulus efforts by the cabinet of Japan’s first female prime minister, Sanae Takaichi, have created a spike in long-dated yields of Japanese government bonds and further weakness in the yen (USDJPY) in the past few weeks. It’s a situation that is being likened to the September-October 2022 crisis in the U.K., which stemmed from a crisis in confidence over a package of unfunded tax cuts proposed by then-Prime Minister Liz Truss’s government.

    Read: Liz Truss redux? Simultaneous drop for Japanese currency and bonds draws eerie parallels

    The U.S. needs to manage the cost of interest payments given a more than $38 trillion national debt, and this is a primary motivation for why the Trump administration wants to bring down long-term Treasury yields. Last week, Treasury Secretary Scott Bessent said in a speech in New York that the U.S. is making substantial progress in keeping most market-based rates down. He also said the 10-year “term premium,” or additional compensation demanded by investors to hold the long-dated maturity, is basically unchanged. Longer-duration yields matter because they provide a peg for borrowing rates used by U.S. households, businesses and the government.

    Developments in Japan are now creating the risk that U.S. yields could rise alongside Japan’s yields. This week, Japanese government-bond yields hit their highest levels in almost two decades, with the country’s 10-year rate BX:TMBMKJP-10Y spiking above 1.78% to its highest level in more than 17 years. The 40-year yield BX:TMBMKJP-40Y climbed to an all-time high just above 3.7%.

    In the U.S., 2- BX:TMUBMUSD02Y and 10-year yields BX:TMUBMUSD10Y finished Friday’s session at the lowest levels of the past three weeks, at 3.51% and almost 4.06% respectively. The 30-year U.S. yield BX:TMUBMUSD30Y fell to 4.71% or lowest level since Nov. 13.

    There’s a risk now that U.S. yields may not fall as much as they otherwise might after factoring in market-implied expectations for a series of interest-rate cuts by the Federal Reserve into 2026.

    Japan’s large U.S. footprint

    Treasury yields are not going to necessarily follow rates on Japanese government bonds higher “on a one-for-one basis,” but there might be a limit on how low they can go, said Adam Turnquist, chief technical strategist at LPL Financial. He added that the impact of Japanese developments on the U.S. bond market could take years to play out, but “we care now because of the direction Japan’s policy is going in” and the possibility that this impact might occur even sooner.

    Some of the catalysts that usually tend to push Treasury yields lower, such as any commentary from U.S. monetary policymakers that suggests the Fed might be inclined to cut rates, “might be muted because of the increased value of foreign debt,” Turnquist added.

    U.S. government debt rallied for a second day on Friday, pushing yields lower, after New York Fed President John Williams said there is room to cut interest rates in the near term.

    All three major U.S. stock indexes DJIA SPX COMP closed higher Friday, but notched sharp weekly losses, as investors attempted to calm doubts over the artificial-intelligence trade.

    The troubling spike in yields on Japanese government bonds hasn’t fully spilled over into the U.S. bond market yet, but it remains a risk. “A repeat of the Truss episode is what people are afraid of,” said Marc Chandler, chief market strategist and managing director at Bannockburn Capital Markets.

    Concerns about Japan gained added significance on Friday, when Takaichi’s cabinet approved a 21.3 trillion yen (or roughly $140 billion) economic stimulus package, which Reuters described as lavish. The amount of new spending being injected into the country’s economy from a supplementary budget, much of which is not repurposed from existing funds, is 17.7 trillion yen ($112 billion).

    Anxiety over Takaichi’s stimulus efforts has resulted in a Japanese yen that has weakened against its major peers and fallen to a 10-month low ahead of Friday’s session, and in a spike in the country’s long-dated yields. Yields on 30-year BX:TMBMKJP-30Y Japanese government debt have risen this month to 3.33%.

    Japan is the biggest foreign holder of Treasurys, with a roughly 13% share, according to the most recent data from the U.S. Treasury Department, and the concern is that the country’s investors might one day pull the rug by keeping more of their savings at home.

    Bond-auction anxiety

    Earlier in the week, a weak 20-year auction in Japan was cited as one reason why U.S. Treasury yields were a touch lower in early New York trading, which means that demand for U.S. government paper remained in place. Global investors are often incentivized to move their money based on which country offers the highest yields and best overall value.

    “The conventional wisdom is that as yields rise in Japan, the Japanese are more likely to keep their savings at home rather than export it,” Chandler said. “The Japanese have been buyers of Treasurys and U.S. stocks, and if they decide to keep their money at home, those U.S. markets could lose a bid.”

    For now, Japanese investors, which include insurers and pension funds, appear to be continuing to export their savings by buying more foreign government debt like Treasurys. Data from the U.S. Treasury Department shows that as of September, Japanese investors held just under $1.19 trillion in Treasurys, a number which has been climbing every month this year and is up from about $1.06 trillion last December.

    One reason for this is the exchange rate. The yen has depreciated against almost every major currency this year. Japanese investors have been buying U.S. Treasurys because they can diversify against the yen, which is the weakest of the G-10 currencies on an unhedged basis, according to Chandler.

    If concerns about the Takaichi government’s stimulus efforts translate into even higher yields in Japan, this could incentivize local investors to keep more of their savings at home, but might also mean rising yields for countries like the U.S.

    -Vivien Lou Chen

    This content was created by MarketWatch, which is operated by Dow Jones & Co. MarketWatch is published independently from Dow Jones Newswires and The Wall Street Journal.

    (END) Dow Jones Newswires

    11-21-25 1609ET

    Copyright (c) 2025 Dow Jones & Company, Inc.

    Continue Reading

  • Pre-Conception Hypertension Linked to Adverse Pregnancy Outcomes in IgA Nephropathy

    Pre-Conception Hypertension Linked to Adverse Pregnancy Outcomes in IgA Nephropathy

    Recent findings on pregnancy outcomes in women with IgA nephropathy (IgAN) suggest pre-conception use of non-renin-angiotensin-aldosterone system inhibitor (RASi) antihypertensive medications correlates with increased risk of severe hypertensive…

    Continue Reading

  • Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization

    Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization

    • We’re introducing Zoomer, Meta’s comprehensive, automated debugging and optimization platform for AI. 
    • Zoomer works across all of our training and inference workloads at Meta and provides deep performance insights that enable energy savings, workflow acceleration, and efficiency gains in our AI infrastructure. 
    • Zoomer has delivered training time reductions, and significant QPS improvements, making it the de-facto tool for AI performance optimization across Meta’s entire AI infrastructure.

    At the scale that Meta’s AI infrastructure operates, poor performance debugging can lead to massive energy inefficiency, increased operational costs, and suboptimal hardware utilization across hundreds of thousands of GPUs. The fundamental challenge is achieving maximum computational efficiency while minimizing waste. Every percentage point of utilization improvement translates to significant capacity gains that can be redirected to innovation and growth.

    Zoomer is Meta’s automated, one-stop-shop platform for performance profiling, debugging, analysis, and optimization of AI training and inference workloads. Since its inception, Zoomer has become the de-facto tool across Meta for GPU workload optimization, generating tens of thousands of profiling reports daily for teams across all of our apps. 

    Why Debugging Performance Matters

    Our AI infrastructure supports large-scale and advanced workloads across a global fleet of GPU clusters, continually evolving to meet the growing scale and complexity of generative AI.

    At the training level it supports a diverse range of workloads, including powering models for ads ranking, content recommendations, and GenAI features.  

    At the inference level, we serve hundreds of trillions of AI model executions per day.

    Operating at this scale means putting a high priority on eliminating GPU underutilization. Training inefficiencies delay model iterations and product launches, while inference bottlenecks limit our ability to serve user requests at scale. Removing resource waste and accelerating workflows helps us train larger models more efficiently, serve more users, and reduce our environmental footprint.

    AI Performance Optimization Using Zoomer

    Zoomer is an automated debugging and optimization platform that works across all of our AI model types (ads recommendations, GenAI, computer vision, etc.) and both training and inference paradigms, providing deep performance insights that enable energy savings, workflow acceleration, and efficiency gains.  

    Zoomer’s architecture consists of three essential layers that work together to deliver comprehensive AI performance insights: 

    Infrastructure and Platform Layer

    The foundation provides the enterprise-grade scalability and reliability needed to profile workloads across Meta’s massive infrastructure. This includes distributed storage systems using Manifold (Meta’s blob storage platform) for trace data, fault-tolerant processing pipelines that handle huge trace files, and low-latency data collection with automatic profiling triggers across thousands of hosts simultaneously. The platform maintains high availability and scale through redundant processing workers and can handle huge numbers of profiling requests during peak usage periods.

    Analytics and Insights Engine

    The core intelligence layer delivers deep analytical capabilities through multiple specialized analyzers. This includes: GPU trace analysis via Kineto integration and NVIDIA DCGM, CPU profiling through StrobeLight integration, host-level metrics analysis via dyno telemetry, communication pattern analysis for distributed training, straggler detection across distributed ranks, memory allocation profiling (including GPU memory snooping), request/response profiling for inference workloads, and much more. The engine automatically detects performance anti-patterns and also provides actionable recommendations.

    Visualization and User Interface Layer

    The presentation layer transforms complex performance data into intuitive, actionable insights. This includes interactive timeline visualizations showing GPU activity across thousands of ranks, multi-iteration analysis for long-running training workloads, drill-down dashboards with percentile analysis across devices, trace data visualization integrated with Perfetto for kernel-level inspection, heat map visualizations for identifying outliers across GPU deployments, and automated insight summaries that highlight critical bottlenecks and optimization opportunities.

    The three essential layers of Zoomer’s architecture.

    How Zoomer Profiling Works: From Trigger to Insights

    Understanding how Zoomer conducts a complete performance analysis provides insight into its sophisticated approach to AI workload optimization.

    Profiling Trigger Mechanisms

    Zoomer operates through both automatic and on-demand profiling strategies tailored to different workload types. For training workloads, which involve multiple iterations and can run for days or weeks, Zoomer automatically triggers profiling around iteration 550-555 to capture stable-state performance while avoiding startup noise. For inference workloads, profiling can be triggered on-demand for immediate debugging or through integration with automated load testing and benchmarking systems for continuous monitoring.

    Comprehensive Data Capture

    During each profiling session, Zoomer simultaneously collects multiple data streams to build a holistic performance picture: 

    • GPU Performance Metrics: SM utilization, GPU memory utilization, GPU busy time, memory bandwidth, Tensor Core utilization, power consumption, clock frequencies, and power consumption data via DCGM integration.
    • Detailed Execution Traces: Kernel-level GPU operations, memory transfers, CUDA API calls, and communication collectives via PyTorch Profiler and Kineto.
    • Host-Level Performance Data: CPU utilization, memory usage, network I/O, storage access patterns, and system-level bottlenecks via dyno telemetry.
    • Application-Level Annotations: Training iterations, forward/backward passes, optimizer steps, data loading phases, and custom user annotations.
    • Inference-Specific Data: Rate of inference requests, server latency, active requests, GPU memory allocation patterns, request latency breakdowns via Strobelight’s Crochet profiler, serving parameter analysis, and thrift request-level profiling.
    • Communication Analysis: NCCL collective operations, inter-node communication patterns, and network utilization for distributed workloads

    Distributed Analysis Pipeline

    Raw profiling data flows through sophisticated processing systems that deliver multiple types of automated analysis including:

    • Straggler Detection: Identifies slow ranks in distributed training through comparative analysis of execution timelines and communication patterns.
    • Bottleneck Analysis: Automatically detects CPU-bound, GPU-bound, memory-bound, or communication-bound performance issues.
    • Critical Path Analysis: Systematically identifies the longest execution paths to focus optimization efforts on highest-impact opportunities.
    • Anti-Pattern Detection: Rule-based systems that identify common efficiency issues and generate specific recommendations.
    • Parallelism Analysis: Deep understanding of tensor, pipeline, data, and expert parallelism interactions for large-scale distributed training.
    • Memory Analysis: Comprehensive analysis of GPU memory usage patterns, allocation tracking, and leak detection.
    • Load Imbalance Analysis: Detects workload distribution issues across distributed ranks and recommendations for optimization.

    Multi-Format Output Generation

    Results are presented through multiple interfaces tailored to different user needs: interactive timeline visualizations showing activity across all ranks and hosts, comprehensive metrics dashboards with drill-down capabilities and percentile analysis, trace viewers integrated with Perfetto for detailed kernel inspection, automated insights summaries highlighting key bottlenecks and recommendations, and actionable notebooks that users can clone to rerun jobs with suggested optimizations.

    Specialized Workload Support

    For massive distributed training for specialized workloads, like GenAI, Zoomer contains a purpose-built platform supporting LLM workloads that offers specialized capabilities including GPU efficiency heat maps and N-dimensional parallelism visualization. For inference, specialized analysis covers everything from single GPU models, soon expanding to massive distributed inference across thousands of servers.

    A Glimpse Into Advanced Zoomer Capabilities

    Zoomer offers an extensive suite of advanced capabilities designed for different AI workload types and scales. While a comprehensive overview of all features would require multiple blog posts, here’s a glimpse at some of the most compelling capabilities that demonstrate Zoomer’s depth:

    Training Powerhouse Features:

    • Straggler Analysis: Helps identify ranks in distributed training jobs that are significantly slower than others, causing overall job delays due to synchronization bottlenecks. Zoomer provides information that helps diagnose root causes like sharding imbalance or hardware issues.
    • Critical Path Analysis: Identification of the longest execution paths in PyTorch applications, enabling accurate performance improvement projections
    • Advanced Trace Manipulation: Sophisticated tools for compression, filtering, combination, and segmentation of massive trace files (2GB+ per rank), enabling analysis of previously impossible-to-process large-scale training jobs

    Inference Excellence Features:

    • Single-Click QPS Optimization: A workflow that identifies bottlenecks and triggers automated load tests with one click, reducing optimization time while delivering QPS improvements of +2% to +50% across different models, depending on model characteristics. 
    • Request-Level Deep Dive: Integration with Crochet profiler provides Thrift request-level analysis, enabling identification of queue time bottlenecks and serving inefficiencies that traditional metrics miss.
    • Realtime Memory Profiling: GPU memory allocation tracking, providing live insights into memory leaks, allocation patterns, and optimization opportunities.

    GenAI Specialized Features:

    • LLM Zoomer for Scale: A purpose-built platform supporting 100k+ GPU workloads with N-dimensional parallelism visualization, GPU efficiency heat maps across thousands of devices, and specialized analysis for tensor, pipeline, data, and expert parallelism interactions.
    • Post-Training Workflow Support: Enhanced capabilities for GenAI post-training tasks including SFT, DPO, and ARPG workflows with generator and trainer profiling separation.

    Universal Intelligence Features:

    • Holistic Trace Analysis (HTA): Advanced framework for diagnosing distributed training bottlenecks across communication overhead, workload imbalance, and kernel inefficiencies, with automatic load balancing recommendations.
    • Zoomer Actionable Recommendations Engine (Zoomer AR): Automated detection of efficiency anti-patterns with machine learning-driven recommendation systems that generate auto-fix diffs, optimization notebooks, and one-click job re-launches with suggested improvements.
    • Multi-Hardware Profiling: Native support across NVIDIA GPUs, AMD MI300X, MTIA, and CPU-only workloads with consistent analysis and optimization recommendations regardless of hardware platform.

    Zoomer’s Optimization Impact: From Debugging to Energy Efficiency

    Performance debugging with Zoomer creates a cascading effect that transforms low-level optimizations into massive efficiency gains. 

    The optimization pathway flows from: identifying bottlenecks → improving key metrics → accelerating workflows → reducing resource consumption → saving energy and costs.

    Zoomer’s Training Optimization Pipeline

    Zoomer’s training analysis identifies bottlenecks in GPU utilization, memory bandwidth, and communication patterns. 

    Example of Training Efficiency Wins: 

    • Algorithmic Optimizations: We delivered power savings through systematic efficiency improvements across the training fleet, by fixing reliability issues for low efficiency jobs.
    • Training Time Reduction Success: In 2024, we observed a 75% training time reduction for Ads relevance models, leading to 78% reduction in power consumption.
    • Memory Optimizations: One-line code changes for performance issues due to inefficient memory copy identified by Zoomer, delivered 20% QPS improvements with minimal engineering effort. 

    Inference Optimization Pipeline:

    Inference debugging focuses on latency reduction, throughput optimization, and serving efficiency. Zoomer identifies opportunities in kernel execution, memory access patterns, and serving parameter tuning to maximize requests per GPU.

    Inference Efficiency Wins:

    • GPU and CPU Serving parameters Improvements: Automated GPU and CPU bottleneck identification and parameter tuning, leading to 10% to 45% reduction in power consumption.
    • QPS Optimization: GPU trace analysis used to boost serving QPS and optimize serving capacity.

    Zoomer’s GenAI and Large-Scale Impact

    For massive distributed workloads, even small optimizations compound dramatically. 32k GPU benchmark optimizations achieved 30% speedups through broadcast issue resolution, while 64k GPU configurations delivered 25% speedups in just one day of optimization.

    The Future of AI Performance Debugging

    As AI workloads expand in size and complexity, Zoomer is advancing to meet new challenges focused on several innovation fronts: broadening unified performance insights across heterogeneous hardware (including MTIA and next-gen accelerators), building advanced analyzers for proactive optimization, enabling inference performance tuning through serving param optimization, and democratizing optimization with automated, intuitive tools for all engineers. As Meta’s AI infrastructure continues its rapid growth, Zoomer plays an important role in helping us innovate efficiently and sustainably.


    Continue Reading

  • Here are real AI stocks to invest in and speculative ones to avoid

    Here are real AI stocks to invest in and speculative ones to avoid

    Continue Reading

  • Robert F Kennedy Jr instructed CDC to change stance on vaccine and autism | Robert F Kennedy Jr

    Robert F Kennedy Jr instructed CDC to change stance on vaccine and autism | Robert F Kennedy Jr

    Robert F Kennedy Jr, the US health secretary, said in an interview with the New York Times that he personally instructed the federal Centers for Disease Control and Prevention (CDC) to change its longstanding position that vaccines do not cause…

    Continue Reading

  • The Stand With Public Media Gala Honored Truth-seekers and Storytellers

    The Stand With Public Media Gala Honored Truth-seekers and Storytellers

    Spoiler alert: when Manhattan eventually became their joint address, Colbert realized what he’d been missing. “WNYC got you going in the morning: all the information, all the culture, and all the things you needed to know about New York and…

    Continue Reading