Introduction
The diagnosis and classification of esophageal motility disorders have undergone evolution since the introduction of high-resolution esophageal manometry (HRM) in the early 2000s []. This technological advancement, characterized by closely spaced pressure sensors providing spatiotemporal pressure topography displays, has altered our understanding of esophageal physiology and pathophysiology [,]. The subsequent development and iterative refinement of the Chicago Classification, now in its fourth version, has established a standardized framework for HRM interpretation that has become the global standard for esophageal motility assessment [,]. Despite these advances, significant challenges persist in clinical practice, including substantial interobserver variability even among expert interpreters, time-intensive analysis requirements, and the need for extensive training to achieve competency in HRM interpretation [,].
In recent years, interest in applying artificial intelligence (AI) to medical data has surged [,]. AI in medicine encompasses methods ranging from classical statistical models to advanced deep learning and even generative models. These approaches can rapidly analyze large datasets and automatically extract complex features, making them well-suited to assist in health care data interpretation []. Gastroenterology has seen rapid exploration of AI for endoscopic image analysis, pathology slide interpretation, and other tasks []. Recent comprehensive reviews have demonstrated AI’s expanding role across gastroenterological applications, from polyp detection to diagnostic decision support systems, with particular promise in image-based diagnostics []. Large language models have also emerged as potential tools for clinical documentation and patient education in gastroenterology, though their role in technical interpretation remains under investigation []. Within the field of neurogastroenterology and motility, AI technologies offer particularly compelling advantages given the pattern-based nature of HRM interpretation and the quantitative parameters inherent to manometric analysis. Machine learning algorithms excel at pattern recognition tasks, potentially surpassing human capabilities in identifying subtle abnormalities and maintaining consistent diagnostic criteria application [,]. Furthermore, AI systems can process vast quantities of data instantaneously, enabling real-time interpretation that could transform clinical workflow efficiency [,]. Recent reviews have examined AI applications in general gastroenterology [-]. However, a focused analysis of HRM-specific applications remains lacking.
The evolution of AI methodologies in medical imaging and signal processing has particular relevance to HRM analysis []. Early applications relied on traditional machine learning approaches such as support vector machines and random forests, which required manual feature extraction and engineering [,]. These methods, while showing promise, were limited by their dependence on predefined features and inability to capture complex spatiotemporal patterns inherent to esophageal pressure topography. The advent of deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image analysis by enabling automatic feature learning directly from raw data [,]. For HRM, this capability allows AI systems to identify novel patterns and relationships that may not be apparent to human observers or captured by traditional metrics. Recent systematic assessments of AI tools in esophageal dysmotility diagnosis have documented the progression from basic automation of landmark identification to sophisticated deep learning models capable of comprehensive Chicago Classification diagnosis []. Contemporary applications now encompass not only HRM but also impedance-pH monitoring, demonstrating the broadening scope of AI in esophageal diagnostics [].
Recent technological advances have further expanded the potential applications of AI in esophageal motility assessment. The integration of complementary diagnostic modalities, such as Functional Luminal Imaging Probe (FLIP) technology and high-resolution impedance manometry, provides multidimensional data that can enhance diagnostic accuracy []. AI platforms have demonstrated 89% accuracy in automated interpretation of FLIP Panometry studies, validating the feasibility of automated esophageal motility classification during endoscopy []. AI systems are uniquely positioned to synthesize these complex, multimodal datasets, potentially revealing pathophysiological insights that single-modality assessment cannot provide []. Moreover, the development of cloud-based computing infrastructure and edge computing capabilities enables the deployment of sophisticated AI models in diverse clinical settings, from tertiary referral centers to community practices [,]. The emergence of generative artificial intelligence and large language model–assisted development has further accelerated model creation, with recent studies demonstrating the successful implementation of Gemini-assisted (Google LLC) deep learning for automated HRM diagnosis, achieving high diagnostic precision across multiple motility disorder categories [].
Despite these promising developments, no comprehensive systematic review has evaluated the full spectrum of AI applications in HRM interpretation or assessed their methodological quality. Therefore, this systematic review aims to (1) systematically evaluate current AI applications in HRM interpretation, (2) assess diagnostic accuracy across different AI methodologies, (3) evaluate methodological quality, and (4) identify barriers to clinical implementation and future research priorities.
Methods
Study Design
The protocol was registered in PROSPERO (International Prospective Register of Systematic Review; CRD420251154237) before initiating the search. This systematic review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 reporting guidelines [] (), PRISMA-Diagnostic Test Accuracy () checklist [], and PRISMA-S (Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Search, an extension to the PRISMA statement for reporting literature searches in systematic reviews; ) checklist [].
Database and Searching Strategy
We searched PubMed/MEDLINE, Embase, Cochrane Library, and Web of Science through September 2025, for studies using AI or machine learning to interpret esophageal HRM. Search strategies incorporated keywords and indexed terms, including (“artificial intelligence” OR “machine learning” OR “deep learning” OR “neural network” OR “computer-aided diagnosis”) AND (“high-resolution manometry” OR “HRM” OR “esophageal manometry” OR “esophageal motility” OR “Chicago Classification”; ). Gray literature sources were searched to reduce publication bias.
Database: MEDLINE (through PubMed)
#1 “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network”[tiab] OR “computer-aided diagnosis”[tiab]: 345034
#2 “high-resolution manometry”[tiab] OR “HRM”[tiab] OR “esophageal manometry”[tiab] OR “esophageal motility”[tiab] OR “Chicago Classification”[tiab] OR “Gastrointestinal motility”[tiab]: 15092
#3 #1 AND #2: 116
#4 #3 AND English[Lang]: 114
Database: Embase-OVID
#1 ‘artificial intelligence’:ab,ti,kw OR ‘machine learning’:ab,ti,kw OR ‘deep learning’:ab,ti,kw OR ‘neural network’:ab,ti,kw OR ‘computer-aided diagnosis’:ab,ti,kw: 173049
#2 ‘high-resolution manometry’:ab,ti,kw OR ‘HRM’:ab,ti,kw OR ‘esophageal manometry’:ab,ti,kw OR ‘esophageal motility’:ab,ti,kw OR ‘Chicago Classification’:ab,ti,kw OR ‘Gastrointestinal motility ‘:ab,ti,kw: 38254
#3 #1 AND #2: 73
#4 #3 AND ([article]/lim OR [article in press]/lim OR [review]/lim) AND [English]/lim: 39
Database: Cochrane Library (Through Wiley)
#1 ‘artificial intelligence’:ab,ti,kw OR ‘machine learning’:ab,ti,kw OR ‘deep learning’:ab,ti,kw OR ‘neural network’:ab,ti,kw OR ‘computer-aided diagnosis’:ab,ti,kw: 11482
#2 ‘high-resolution manometry’:ab,ti,kw OR ‘HRM’:ab,ti,kw OR ‘esophageal manometry’:ab,ti,kw OR ‘esophageal motility’:ab,ti,kw OR ‘Chicago Classification’:ab,ti,kw OR ‘Gastrointestinal motility’:ab,ti,kw: 4636
#3 #1 AND #2: 36
Database: Web of Science
#1 ab=(“artificial intelligence” OR “machine learning” OR “deep learning” OR “neural network” OR “computer-aided diagnosis”): 645285
#2 ab=(“high-resolution manometry” OR “HRM” OR “esophageal manometry” OR “esophageal motility” OR “Chicago Classification” OR ‘Gastrointestinal motility’): 9769
#3 #1 AND #2: 138
Additional information sources were systematically searched to identify gray literature and unpublished studies. We searched the medRxiv preprint server [] using the same search terms to identify studies not yet formally published (advanced searching tab). ClinicalTrials.gov [] was searched to identify ongoing or completed trials that may not have been published. Reference lists of all included studies and relevant systematic reviews were manually screened to identify additional eligible studies. No citation reference searches were performed using citation databases.
The search strategy was peer reviewed by information scientists who have extensive expertise in systematic review methodology and database search strategies.
The results from all database searches were exported and deduplicated using EndNote X20 (Clarivate Analytics, 2020). Automated deduplication was performed using EndNote’s duplicate identification algorithm, followed by manual review to identify and remove any remaining duplicates based on title, author, year, and journal. Two reviewers (CSB and EJG) independently screened studies, and discrepancies were resolved by discussion ().
Inclusion and Exclusion Criteria
We included both prospective and retrospective studies that applied an AI-based algorithm to HRM measurements for diagnosing or classifying esophageal motility disorders (eg, achalasia subtypes, esophagogastric junction outflow obstruction, distal esophageal spasm, hypercontractile esophagus, ineffective motility, etc). We excluded nonhuman studies, conference abstracts without full text, studies focusing on anorectal manometry, and studies on other modalities (such as FLIP or pH-impedance) unless they directly involved HRM data integration.
The detailed inclusion criteria are as follows: (1) original research applying AI, machine learning, or deep learning techniques to HRM data; (2) evaluation of diagnostic accuracy, classification performance, or clinical outcomes; (3) inclusion of human participants or HRM studies; and (4) provision of quantitative performance metrics. The exclusion criteria are as follows: (1) review papers, editorials, or case reports without original data; (2) used only conventional manometry without high-resolution capabilities; (3) applied AI exclusively to other esophageal diagnostic modalities without HRM integration; and (4) lacked sufficient methodological detail for quality assessment.
Data Extraction
Two independent reviewers (CSB and EJG) systematically extracted data using a standardized, prepiloted form. Extracted variables included: study characteristics (authors, year, country, and design), patient demographics (sample size, age, and sex distribution), HRM technical specifications (equipment, protocol, and Chicago Classification version), AI methodology (algorithm type, architecture, and training approach), dataset characteristics (size, split ratios, and validation method), performance metrics (sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve [AUROC]), clinical outcomes when available, and implementation considerations. Discrepancies were resolved through consensus or third reviewer (GHB) arbitration. Authors were contacted for missing or unclear data, with a maximum of 3 contact attempts over 4 weeks.
Study Outcomes
Primary outcome measures included diagnostic accuracy metrics for AI systems compared to expert interpretation as the reference standard. Sensitivity, specificity, positive and negative predictive values, and accuracy were calculated when raw data were available. For studies reporting only AUROC values, these were extracted directly. Meta-analysis was planned if sufficient homogeneity existed across studies; however, due to significant heterogeneity in AI approaches, patient populations, and outcome definitions, a narrative synthesis was performed.
Secondary outcomes included: external validation performance compared to internal validation, processing time for automated interpretation, comparison with trainee interpretation, interrater reliability metrics, and clinical outcomes when reported. Subgroup analyses examined performance differences by: AI methodology (traditional machine learning vs deep learning), disorder category according to the Chicago Classification, validation approach (internal vs external), and year of publication to assess temporal trends.
Quality Assessment
We assessed the methodological quality and risk of bias of each included study using the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2) tool. This tool evaluates risk of bias in 4 domains: patient selection, index test, reference standard, and flow and timing. For each domain, we judged the risk of bias as low, high, or unclear based on the information reported in the study, and we also noted any concerns regarding applicability to the review question []. Two reviewers (CSB and EJG) performed the QUADAS-2 assessments independently, with disagreements resolved through discussion.
Results
Study Selection and Inclusion
Literature search yielded 411 studies from databases and 1 additional record from manual screening. After removing duplicates, 175 studies remained. Following title and abstract screening, 100 full-text papers were assessed for eligibility. Of these, 83 were excluded. Ultimately, 17 studies met inclusion criteria (Figure 1).
is the PRISMA flow diagram for systematic review of AI applications in HRM (2013-2025). Literature search across PubMed/MEDLINE, Embase, Cochrane Library, and Web of Science (database inception through November 2025) identified studies applying AI, machine learning, or deep learning techniques to interpret HRM for diagnosis of esophageal motility disorders. The diagram illustrates the screening process.
Study Characteristics
Studies were published between 2013-2025, with 82% (14/17) of the studies published in 2020 or later. The studies with clearly documented patient numbers included: Hoffman et al [], with 30 participants with dysphagia, Rohof et al [], 50 patients with gastroesophageal reflux disease, Jungheim et al [] with 15 healthy volunteers, Kou et al [] with 2161 HRM cases, Kou et al [] study with 1741 HRM cases, Wang et al [] with 229 esophageal motility cases from 229 individuals, Surdea-Blaga et al [] with 192 HRM studies (patients), Rafieivand et al [] with 67 patients, Zifan et al [] with 60 patients, and Lankarani et al [] with 43 patients. The total confirmed patient count from studies with explicit numbers was at least 4588 patients, though several studies did not report exact patient numbers. Publication years ranged from 2013 to 2025, with 82% (14/17) published after 2020, reflecting the recent emergence of this field. Study designs were predominantly retrospective cohort studies (n=15, 88%), with 2 methodological development studies (n=2, 12%; Rohof et al [] and Kou et al []). No prospective validation studies were identified. All studies used the Chicago Classification as the reference standard, with varying versions used across studies ().
| Study and year | Country | Sample size | AIb method | Study aims | Performance | Validation | Chicago classification |
| Hoffman et al, 2013 [] | United States |
|
|
|
|
Internal validation only | Unspecified |
| Rohof et al, 2014 [] | Australia |
|
|
|
Inter- and intrarater | v2.0 | |
| Jungheim et al, 2016 [] | Germany |
|
|
|
|
Expert comparison | v2.0 |
| Jell et al, 2020 [] | Germany |
|
|
|
|
Internal validation only | Unspecified |
| Czako et al, 2021 [] | Romania |
|
|
|
Internal validation only | v2.0 | |
| Kou et al, 2021 [] | United States |
|
|
|
|
Internal validation only | v2.0 |
| Kou et al, 2022 [] | United States |
|
|
|
Internal validation only | v3.0 | |
| Wang et al, 2021 [] | China |
|
|
|
|
Internal validation only | v3.0 |
| Kou et al, 2022 [] | United States |
|
|
Internal validation only | v3.0 | ||
| Surdea-Blaga et al, 2022 [] | Romania |
|
|
|
|
Internal validation only | v3.0 |
| Popa et al, 2022 [] | Romania |
|
|
Internal validation only | v3.0 | ||
| Rafieivand et al, 2023 [] | Iran |
|
|
|
Internal validation only | v3.0 | |
| Zifan et al, 2023 [] | United States |
|
|
|
|
Internal validation only | v4.0 |
| Zifan et al, 2024 [] | United States |
|
|
|
Internal validation only | v4.0 | |
| Lankarani et al, 2024 [] | Iran |
|
|
|
Internal validation only | v4.0 | |
| Popa et al, 2024 [] | Romania |
|
|
|
Internal validation only | v3.0 | |
| Wu et al, 2025 [] | China |
|
|
Internal validation only | v4.0 |
aCharacteristics and outcomes of 17 included studies evaluating artificial intelligence for high-resolution manometry interpretation (2013-2025). Studies encompassed 4588 patients from 6 countries (United States, Romania, Germany, Iran, China, and multicenter European studies) with sample sizes ranging from 15 to 2161 participants. presents: study design (retrospective, prospective, or validation studies), patient population characteristics, artificial intelligence methodology used (traditional machine learning vs deep learning approaches), specific diagnostic tasks (eg, Chicago Classification diagnosis, integrated relaxation pressure classification, and swallow type identification), reference standards used for model training or validation, diagnostic performance metrics (accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve), and key findings.
bAI: artificial intelligence.
cMBSImP: Modified Barium Swallow Impairment Profile.
dAUROC: area under the receiver operating characteristic curve.
eGERD: gastroesophageal reflux disease.
fAIMplot: automated impedance manometry analysis.
gAIM: automated impedance manometry.
hICC: intraclass correlation coefficient.
iUES: upper esophageal sphincter.
jHRM: high-resolution manometry.
kCNN: convolutional neural network.
lIRP: integrated relaxation pressure.
mLSTM: long short-term memory.
nLLM: large language model.
Time Trend of AI Application in HRM Interpretation
The application of AI to HRM interpretation has shown continuous evolution since 2013. Early pioneers such as Hoffman et al (2013) [] applied artificial neural networks to pharyngeal HRM classification, achieving 86.5%-94% accuracy with 335 swallows. During this initial period (2013-2016), researchers focused primarily on automating specific parameter measurements. Rohof et al (2014) [] created the automated impedance manometry analysis automated analysis system with excellent reproducibility (intraclass correlation coefficient: 0.94-0.95), and Jungheim et al (2016) [] applied machine learning to calculate upper esophageal sphincter restitution times.
A methodological shift occurred around 2018 when researchers began adopting deep learning approaches. Jell et al (2020) [] achieved 97.7% accuracy in automated swallow detection using supervised machine learning. The period from 2020-2022 saw widespread adoption of CNNs. Czako et al (2021) [] achieved 97% accuracy for integrated relaxation pressure (IRP) classification using InceptionV3 (Google LLC) CNN with 2437 images. Kou et al (2021) [] developed both an unsupervised variational autoencoder analyzing 32,415 swallows from 2161 patients and a supervised long short-term memory network achieving 83% accuracy []. Wang et al (2021) [] implemented temporal modeling with Bidirectional Convolutional long short-term memory networks, reaching 91.32% overall accuracy. Romanian researchers, including Surdea-Blaga et al (2022) [] and Popa et al (2022) [], achieved 86% and 94% accuracy, respectively, for Chicago Classification automation.
Recent studies from 2023 onwards have explored increasingly sophisticated and diverse approaches. Zifan et al (2023) [] used shallow machine learning approaches, including logistic regression, random forests, and k-nearest neighbors, to analyze distension-contraction patterns in 60 patients with functional dysphagia, achieving 91.7% accuracy with logistic regression for proximal segments and 90.5% with random forests for distal segments. Rafieivand et al (2023) [] developed a fuzzy framework with graphical neural network interpretation, achieving 78% single-swallow accuracy but 92.54% patient-level accuracy in 67 patients. Zifan et al (2024) [] further refined their approach using support vector machines to analyze distension-contraction plots, achieving an AUROC of 0.95 in 60 patients. Lankarani et al (2024) [] pioneered noninvasive acoustic analysis combined with AI, achieving 97% accuracy for IRP prediction in 43 patients. Most recently, studies have incorporated large language models, with Popa et al (2024) [] integrating Gemini with deep learning, while Wu et al [] (2025) developed mixed attention ensemble approaches ().
Diagnostic Accuracy Across Studies
Overall diagnostic accuracies ranged from 78% to 97% across the 17 included studies. The highest accuracies were achieved for specific applications: IRP classification (97%) [], acoustic IRP prediction (97%) [], and swallow detection (97.7%) []. For Chicago Classification automation, accuracy varied from 86% to >93% [,]. Functional dysphagia studies demonstrated segment-specific performance differences, with Rafieivand et al [] highlighting the importance of patient-level versus swallow-level accuracy (92.54% vs 78%).
Notably, none of the studies provided detailed performance metrics for individual Chicago Classification categories, such as achalasia subtypes or specific motility disorders. This absence of disorder-specific sensitivity and specificity data limits understanding of AI performance across the full spectrum of esophageal pathology and represents a critical gap for clinical implementation ().
Methodological Quality
QUADAS-2 assessment revealed variable methodological quality across the 17 included studies (). For the patient selection domain, no studies demonstrated low risk of bias, with 14 (82%) studies showing unclear risk primarily due to unreported sampling methods, and 3 (18%) studies showing high risk: Hoffman et al [] included only disordered cohorts without healthy controls, Jungheim et al [] tested only healthy volunteers limiting representativeness, and Lankarani et al [] had a small specialized cohort.
| Study and year | Patient selection | Index test | Reference standard | Flow and timing |
| Hoffman et al, 2013 [] | Hc: no healthy controls | Ld: clear prespecified threshold | L: expert manual standard method | L: complete data, no losses |
| Rohof et al, 2014 [] | Ue: convenience sample; representativeness unknown | U: calibrated on the same dataset, raising overfitting concerns | U: reproducibility focus, not diagnostic | L: complete data, no losses |
| Jungheim et al, 2016 [] | H: healthy only; not representative | U: small n=15, overfit concern | L: reference standard measurements (eg, UESf metrics) and experienced assessors | L: all volunteer data used |
| Jell et al, 2020 [] | U: sampling method not reported | L: supervised machine learning clear model | L: expert annotation | L: all data included |
| Czako et al, 2021 [] | U: sampling method not reported | L: InceptionV3 (Google LLC) with held-out test | L: expert Chicago‑consistent labels | U: 8 patients excluded, and completeness uncertain |
| Kou et al, 2021 [] | U: unclear enrollment method | L: variational autoencoder | H: no validated reference standard | L: all data included |
| Kou et al, 2022 [] | U: unclear enrollment method | L: separate test set; blinded automated inference | L: expert Chicago‑consistent labels | L: all data included |
| Wang et al, 2021 [] | U: unclear enrollment method | L: train, validation, or test separation | L: expert Chicago‑consistent labels | L: all data included |
| Kou et al, 2022 [] | U: unclear enrollment method | L: independent test cohort; rule-based aggregation of swallow‑level models | L: expert Chicago‑consistent labels | L: all data included |
| Surdea-Blaga et al, 2022 [] | U: no explicit enrollment stated | L: CNNsg with hold‑out evaluation | L: expert Chicago‑consistent labels | L: all data included |
| Popa et al, 2022 [] | U: spectrum bias | L: CNN with internal split | L: expert Chicago‑consistent labels | H: excluded indeterminate cases |
| Rafieivand et al, 2023 [] | U: single‑center, small n; sampling not described | L: composite (graph + fuzzy) model | L: expert Chicago‑consistent labels | L: all data included |
| Zifan et al, 2023 [] | U: unclear enrollment method | L: multiple machine learning models with cross-validation | U: details of reference adjudication limited | L: all data included |
| Zifan et al, 2024 [] | U: unclear enrollment method | L: multiple machine learning models with cross-validation | U: details of reference adjudication limited | L: all data included |
| Lankarani et al, 2024 [] | H: small, specialized cohort | L: artificial neural network model | L: expert Chicago‑consistent labels | L: all data included |
| Popa et al, 2024 [] | U: unclear enrollment method | L: LLMh‑assisted pipeline | L: expert Chicago‑consistent labels | L: all data included |
| Wu et al, 2025 [] | U: unclear enrollment method | L: ensemble with cross-validation or hold-out | L: expert Chicago‑consistent labels | L: all data included |
aQUADAS-2: Quality Assessment of Diagnostic Accuracy Studies-2.
bQuality Assessment of Diagnostic Accuracy Studies-2 evaluation of methodological quality and risk of bias for 17 included artificial intelligence studies in high-resolution manometry (2013-2025). Assessment evaluated four domains: (1) patient selection—risk of bias from inappropriate patient selection, exclusions, or case-control design; (2) index test—risk of bias from artificial intelligence model training or validation procedures and threshold determination; (3) reference standard—risk of bias from expert interpretation methods and blinding; and (4) flow and timing—risk of bias from incomplete data or variable intervals between index test and reference standard. Each domain was rated as low risk (L), high risk (H), or unclear risk (U) of bias. Applicability concerns assessed whether study design, patient population, artificial intelligence methodology, or reference standards differed from the review question. The table demonstrates predominant unclear risk in patient selection (14/17, 82% of studies) due to inadequate reporting of recruitment methods, while the index test domain showed the strongest methodological rigor (88% low risk).
cH: high risk.
dL: low risk.
eU: unclear risk.
fUES: upper esophageal sphincter.
gCNN: convolutional neural network.
hLLM: large language model.
The index test domain showed the strongest methodological rigor, with 15 (88%) studies demonstrating low risk of bias through appropriate model training and validation separation. Only 2 (12%) studies showed unclear risk: Rohof et al [] due to calibration on the same dataset raising overfitting concerns, and Jungheim et al [] due to the small sample size (n=15), creating uncertainty in algorithm performance.
For the reference standard domain, 14 (82%) studies had a low risk of bias using expert-determined Chicago Classification labels. Further, 3 (18%) studies showed unclear risk: Rohof et al [] focused on automated metric agreement rather than diagnostic ground truth, and both studies by Zifan et al [,] had limited details on reference adjudication. One study by Kou et al [] showed a high risk as it lacked a validated reference standard for unsupervised clusters.
Flow and timing assessment revealed low risk in 15 (88%) studies, with all patient data included in analyses. One study showed unclear risk (Czako et al []) due to the exclusion of 8 patients with probe-placement failure, and 1 study (Popa et al []) demonstrated high risk by excluding indeterminate cases from analysis, introducing potential spectrum bias.
The predominance of unclear risk in patient selection highlights a systematic reporting deficiency across the literature, with most studies failing to document recruitment and enrollment methods adequately. This pattern, combined with the complete absence of external validation noted elsewhere, raises concerns about the generalizability and real-world applicability of these AI systems.
Secondary Findings
None of the 17 included studies performed external validation using datasets from different institutions or periods. All studies relied on internal validation methods, including train-test splits, k-fold cross-validation, or other internal validation approaches. This complete absence of external validation represents a critical limitation in assessing the generalizability of AI models for HRM interpretation. Studies using k-fold cross-validation [,,,,] reported more conservative performance estimates compared to simple train-test splits, suggesting potential overfitting in single-split validation approaches.
Discussion
Principal Findings
The systematic synthesis of current evidence reveals that AI applications in HRM have demonstrated strong technical performance, with diagnostic accuracies ranging from 78% to 97%, while facing substantial translational challenges. The evolution from traditional machine learning algorithms (86.5%-94% accuracy) to deep learning architectures capable of 97% accuracy for specific tasks represents significant technological progress [,,]. These advances occur within the broader context of AI transformation in gastroenterology, where similar trajectories have been observed in colonoscopy, capsule endoscopy, and inflammatory bowel disease assessment, suggesting that the integration of AI into clinical gastroenterology practice is inevitable rather than speculative [,].
The innovation of AI in HRM extends beyond mere automation. These systems represent a major change in how we approach esophageal motility diagnostics [-], offering solutions to important clinical needs: the global shortage of motility experts, the need for rapid and consistent interpretation [], and the potential for telemedicine integration to serve underserved areas [,].
The diagnostic accuracy achieved by current AI systems, particularly for IRP classification and automated Chicago Classification, addresses a fundamental limitation of HRM interpretation: interobserver variability. AI systems maintain consistent diagnostic criteria application while human experts demonstrate significant intraobserver variability on repeated assessments. This consistency could enable more reliable phenotyping of esophageal motility disorders, facilitating precision medicine approaches that move beyond categorical diagnoses to individualized pathophysiological assessment. The superior performance of AI in quantitative parameter calculation eliminates measurement variability that has plagued HRM interpretation since its inception [].
These accuracy levels have important implications for clinical practice. With health care systems facing increasing pressure to reduce costs while improving outcomes, AI-enabled HRM interpretation could decrease repeat procedures and reduce unnecessary testing costs [,]. Moreover, the consistent application of diagnostic criteria could reduce misdiagnosis-related treatment failures that currently affect a considerable number of patients with esophageal motility disorders [,].
However, the apparent success of AI systems must be contextualized within significant methodological limitations identified through quality assessment. Most critically, no studies demonstrated low risk of bias in patient selection, with 82% (14/17) showing unclear risk due to unreported sampling methods and 18% (n=3) showing high risk due to biased cohort selection [,,]. This systematic deficiency in documenting recruitment and enrollment methods raises fundamental questions about the representativeness of training datasets. The complete absence of external validation across all 17 studies compounds these concerns about generalizability. Internal validation consistently overestimates model performance, and the lack of testing on datasets from different institutions, HRM systems, or patient populations means we have no evidence of real-world performance [].
The complete absence of prospective clinical trials represents the most critical barrier to clinical translation. While retrospective studies demonstrate technical feasibility with accuracies of 78%-97%, these controlled environments fail to capture the complexities of real-world clinical practice. Prospective trials are essential to evaluate: (1) how AI systems perform with real-time data acquisition variability, (2) whether AI recommendations alter clinical decision-making, (3) patient outcomes following AI-guided treatment, and (4) integration challenges within existing clinical workflows. Without such evidence, even the most accurate AI models remain research tools rather than clinical instruments [-].
The evolution through distinct phases of AI development in HRM mirrors broader trends in medical AI but also reveals unique challenges specific to esophageal motility assessment. The transition from traditional machine learning to deep learning approaches yielded substantial performance improvements, yet the “black box” nature of deep learning models poses particular challenges in a field where pathophysiological understanding drives therapeutic decision-making []. Clinicians require not just diagnostic labels but mechanistic insights that inform treatment selection between medical therapy, endoscopic intervention, or surgical management. The development of explainable AI models that provide interpretable features and confidence metrics represents a critical priority for clinical acceptance []. Recent advances in attention mechanisms and gradient-based visualization techniques, as demonstrated in the Popa et al [] study using LIME (Local Interpretable Model-Agnostic Explanations), offer promising approaches for making AI decision-making transparent and clinically meaningful.
The integration of multiple diagnostic modalities through AI platforms addresses a longstanding limitation of isolated HRM interpretation. The combination of manometric, impedance, and complementary data provides a more comprehensive assessment of esophageal function than any single modality alone []. AI systems excel at synthesizing these complex, multidimensional datasets, potentially revealing pathophysiological patterns invisible to conventional analysis. The Zifan et al (2023 [] and 2024 []) work on distension-contraction plots illustrates how AI can extract diagnostic value from data presentations that challenge human interpretation. This capability becomes particularly relevant with the Chicago Classification version 4.0 emphasis on provocative testing and positional changes, which generate substantially more data requiring integration and interpretation [].
The absence of disorder-specific performance metrics across all 17 studies severely limits clinical applicability. While overall accuracy appears promising (86%-97%), clinicians need to know how AI performs for specific conditions: distinguishing achalasia subtypes (critical for treatment selection), detecting subtle ineffective esophageal motility (often missed by novices), or identifying rare disorders such as jackhammer esophagus. A system with 95% overall accuracy but poor performance in type II achalasia, for instance, could lead to inappropriate treatment recommendations. Future studies must report sensitivity and specificity for each Chicago Classification category to enable informed clinical decision-making.
Implementation barriers identified across studies reveal a complex interplay of technical, regulatory, clinical, and economic factors. The incompatibility with existing HRM systems reflects the proprietary nature of medical device software and the lack of interoperability standards. The regulatory uncertainty surrounding AI medical devices requires proactive engagement between developers, clinicians, and regulatory agencies to establish appropriate evaluation frameworks [,]. Despite these barriers, the economic rationale for AI implementation is strong. High-volume centers could achieve cost-effectiveness through improved workflow efficiency and reduced need for expert consultation [,,], though specific economic analyses are needed to quantify these benefits. The lack of specific reimbursement codes for AI-assisted interpretation creates financial uncertainty that discourages adoption []. The potential for AI to enable task-shifting from specialists to general gastroenterologists could address workforce shortages and improve access to motility assessment, particularly in underserved areas.
The ethical implications of AI implementation in HRM diagnostic practice deserve careful consideration []. The potential for algorithmic bias, particularly affecting populations underrepresented in training datasets, could exacerbate existing health care disparities. The predominance of studies from North American, European, and select Asian centers raises concerns about applicability to African, Latin American, and other underrepresented populations with different disease phenotypes and genetic backgrounds []. Development of quality assurance programs that monitor AI performance and identify edge cases requiring human review will be essential for maintaining patient safety.
Moving from laboratory validation to clinical implementation requires addressing multiple translational gaps simultaneously. First, prospective multicenter trials must demonstrate that AI systems maintain performance across diverse patient populations, HRM equipment, and clinical settings. Second, health economic analyses must quantify whether efficiency gains justify implementation costs—a critical requirement for hospital administrator buy-in and insurance coverage. Third, regulatory pathways need clarification: Should AI-HRM systems be classified as clinical decision support tools or diagnostic devices? Each classification carries different validation requirements and liability considerations. Finally, implementation science research must address workflow integration, user training requirements, and change management strategies to ensure successful adoption [].
Future priorities must focus on multicenter validation studies, development of explainable AI models, integration with evolving diagnostic frameworks, and systematic addressing of regulatory and economic barriers. The ultimate success of AI in HRM will depend not on technological sophistication alone but on thoughtful integration that preserves clinical judgment while enhancing diagnostic accuracy and efficiency. To achieve clinical translation, the field must transition from technical validation to clinical validation through (1) prospective trials comparing AI-assisted versus standard interpretation on patient outcomes, (2) disorder-specific performance benchmarking across all Chicago Classification categories, (3) cost-effectiveness analyses demonstrating economic value, (4) regulatory sandbox programs allowing controlled real-world testing, and (5) implementation science studies optimizing integration strategies. Until these translational requirements are met, AI in HRM will remain a promising technology awaiting clinical realization.
Study Limitations
This systematic review has several limitations that should be considered when interpreting the findings. First, the heterogeneity in AI methodologies, patient populations, and outcome definitions precluded meta-analysis, limiting our ability to provide pooled estimates of diagnostic accuracy. Second, we excluded non-English language publications, potentially missing relevant studies from non–English speaking countries. Third, the absence of standardized reporting guidelines for AI studies in HRM made quality assessment challenging, particularly regarding technical aspects of model development. Fourth, publication bias could not be formally assessed due to the diversity of study designs. Fifth, the lack of clinical outcome data across all studies prevented assessment of the real-world impact of AI implementation on patient care, treatment decisions, and health care costs. Finally, critical limitations include the complete absence of low-risk patient selection across all studies, the lack of disorder-specific performance metrics for individual Chicago Classification categories, the absence of prospective clinical trials, no cost-effectiveness analyses, and insufficient direct comparisons between AI and human interpreters using standardized metrics. These gaps collectively limit our ability to assess the true clinical utility and implementation readiness of AI systems in HRM interpretation.
Conclusions
This systematic review provides comprehensive evidence that AI applications in HRM have achieved remarkable technical capabilities while facing substantial challenges in clinical translation. The diagnostic accuracies of 78%-97% demonstrate the potential for AI to standardize and enhance HRM interpretation. However, the complete absence of external validation, systematic deficiencies in patient selection documentation, and lack of clinical outcome studies highlight the critical gap between technological capability and clinical utility. Additionally, the limited reporting of patient demographics across included studies—reflecting the methodological focus of AI development papers—represents an ongoing challenge for assessing generalizability across diverse populations. Future AI validation studies should systematically report demographic characteristics, including age, sex, race or ethnicity, and geographic location, to enable evaluation of algorithmic performance across patient subgroups and identify potential disparities in diagnostic accuracy that could affect equitable clinical implementation.
All the data are accessible and available upon reasonable request to the corresponding author.
This research was supported by the Bio&Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT; No. RS-2023-00223501).
None declared.
Edited by S Brini; submitted 03.Oct.2025; peer-reviewed by X Liang, PJ Kahrilas, S Ho Choi; comments to author 23.Oct.2025; revised version received 06.Nov.2025; accepted 06.Nov.2025; published 27.Nov.2025.
©Eun Jeong Gong, Chang Seok Bang, Jae Jun Lee, Gwang Ho Baik. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 27.Nov.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
