To accelerate the decarbonization of China’s road transport sector by supporting the faster adoption of electric vehicles (EVs) and enhancing the associated infrastructure.
DESCRIPTION
The Project entails AIIB providing an A loan of up to United States Dollar (USD) 125 million equivalent in Chinese Yuan (CNY), complemented by a C loan up to USD125 million equivalent in CNY to be mobilized by AIIB on a best-effort basis, to Ping An International Financial Leasing Co., Ltd. (PAIFL) to support its financial leasing services for urban transport electrification in China.
The loan proceeds will support eligible subprojects through lease financing, targeting underserved segments of China’s EV ecosystem. Approximately 80 percent of the proceeds will be allocated to electric light-duty and heavy-duty trucks, as well as electric passenger vehicles in tier three and tier four cities. The remaining 20 percent will be dedicated to charging infrastructure, with a focus on charging stations for electric heavy-duty trucks, charging networks along highways and major roads, and public fast chargers across China. AIIB financing will follow PAIFL’s Sustainable Development Financing Framework, which is aligned with the Green Loan Principles and Social Loan Principles of the Loan Market Association.
ENVIRONMENTAL AND SOCIAL INFORMATION
Applicable Policy and Categorization: AIIB’s Environmental and Social Framework (ESF), including the Environmental and Social Standards (ESS) and the Environmental and Social Exclusion List is applicable to this Project. The Project is placed in Category FI and is expected to have limited adverse environmental and social (ES) impacts. Subprojects classified as Category A or Higher Risk Activities as per AIIB’s ESF will be excluded from this Project.
Environment and Social Instruments: To manage ES impacts and in accordance with the applicable national laws and regulations and AIIB’s ESF, PAIFL has established an Environmental and Social Management System (ESMS), which shall be enhanced to align with AIIB’s ESF. PAIFL’s enhanced ESMS will exclude all Higher-Risk Activities, consistent with the ESF. Further, according to the ESMS, clients or subprojects with significant non-compliance with environmental, labor practices, health, and safety performance will not be eligible for lease financing by PAIFL.
Environmental and Social Aspects: The operation of EVs and charging infrastructure is considered clean from an environmental perspective. However, a key environmental concern is the disposal of batteries and E-waste. In the context of leasing finance by PAIFL to retail and commercial customers, responsibility for appropriate disposal of discarded batteries lies with EV manufacturers, as guided by the regulations. EV manufacturers are legally required by the Government to take responsibility for end-of-life vehicle batteries and set up systems for collection, storage, and transferring to recycling firms. In addition, it is required that the disposal of EV chargers or charging piles must be carried out by licensed enterprises that meet specific technical, environmental protection, and occupational health and safety (OHS) standards. During decommissioning, the charging station operator engages licensed enterprises to undertake dismantling and material recovery activities. The social risks are expected to be limited to consumer protection for the retail portfolio, and labor and working conditions, gender, health and safety related risks in the business portfolio. Land ownership and/or land lease agreements of the business portfolio are verified by the business department. To align with AIIB’s strategic focus on inclusive and sustainable development, the Project has supported PAIFL in developing a gender action plan (GAP) aimed at enhancing gender equality at both the operational and institutional levels.
Occupational Health and Safety (OHS), Labor and Employment Conditions: OHS risks are expected to be limited to vehicle safety and fire hazards resulting from manufacturing defects and poor maintenance. PAIFL’s customers (individual users and commercial entities) are responsible for ensuring timely maintenance of their EVs. Customers receive warranty documentation and guidance directly from auto dealerships, which outline safe usage practices and maintenance expectations. For charging infrastructure, the PAIFL team visits locations to review suitability. However, maintaining safe working conditions and providing fire extinguishers are the responsibility of PAIFL’s customers. PAIFL urges customers to maintain safe working conditions by implementing necessary safety measures.
Stakeholder Engagement, Consultation and Information Disclosure: PAIFL identifies investors, regulators, and customers as its stakeholders and regularly engages and consults with them to improve its ES risk management practices. The enhanced ESMS will also address relevant stakeholder engagement activities. PAIFL has agreed to disclose an overview of the enhanced ESMS timely on their website.
Project Grievance Redress Mechanism (GRM) and the Arrangement of Monitoring and Reporting: PAIFL has an established external communications mechanism, as project level GRM, to address ES concerns of individuals, enterprises, and other stakeholders. Ping An Insurance (Group) Company of China, Ltd. (Group) has set up a whistleblowing hotline and email address to receive non-consumer customer service-related complaints from internal and external parties. In addition to these channels, affected persons may also lodge a grievance with the local government hotline–12345. PAIFL and its Group provide an online platform for employees to lodge grievances and provide feedback. The information of the established GRMs and Bank’s Project-affected People’s Mechanism (PPM) will be timely disclosed in an appropriate manner. PAIFL will monitor and report material incidents, accidents, negative public opinion, and lawsuits. PAIFL will submit to AIIB annual ESMS performance reports using an agreed-upon template.
Pakistan expresses its deepest condolences and strong solidarity with the Government and people of the People’s Republic of China, as well as with the Government and people of Tajikistan, over the loss of precious…
Each December, the British royal family gathers at Sandringham Estate in Norfolk for a Christmas celebration. Like with their dinner parties and their signatures, the Windsors maintain a rigorous schedule of customs—some dating back to Queen…
In 2021 Evelyn Araluen’s debut collection of poems, Dropbear, dropped on the literary world like an incendiary device. It was shortlisted for three Premier’s literary prizes and won the Stella prize – the first book of poetry to do so –…
Left-arm spinner Matthew Humphreys claimed a career-best 4-13 as Ireland made a winning start to the three-match T20 series with a 39-run victory against Bangladesh on Thursday.
Ireland posted a competitive 181-4 thanks to a well-composed knock…
Charles Leclerc admits he is “really motivated to do something special” with Ferrari at the final two race weekends of the season, beginning with the Qatar Grand Prix.
Ferrari has struggled to match the performance of front runners McLaren, Red…
The diagnosis and classification of esophageal motility disorders have undergone evolution since the introduction of high-resolution esophageal manometry (HRM) in the early 2000s []. This technological advancement, characterized by closely spaced pressure sensors providing spatiotemporal pressure topography displays, has altered our understanding of esophageal physiology and pathophysiology [,]. The subsequent development and iterative refinement of the Chicago Classification, now in its fourth version, has established a standardized framework for HRM interpretation that has become the global standard for esophageal motility assessment [,]. Despite these advances, significant challenges persist in clinical practice, including substantial interobserver variability even among expert interpreters, time-intensive analysis requirements, and the need for extensive training to achieve competency in HRM interpretation [,].
In recent years, interest in applying artificial intelligence (AI) to medical data has surged [,]. AI in medicine encompasses methods ranging from classical statistical models to advanced deep learning and even generative models. These approaches can rapidly analyze large datasets and automatically extract complex features, making them well-suited to assist in health care data interpretation []. Gastroenterology has seen rapid exploration of AI for endoscopic image analysis, pathology slide interpretation, and other tasks []. Recent comprehensive reviews have demonstrated AI’s expanding role across gastroenterological applications, from polyp detection to diagnostic decision support systems, with particular promise in image-based diagnostics []. Large language models have also emerged as potential tools for clinical documentation and patient education in gastroenterology, though their role in technical interpretation remains under investigation []. Within the field of neurogastroenterology and motility, AI technologies offer particularly compelling advantages given the pattern-based nature of HRM interpretation and the quantitative parameters inherent to manometric analysis. Machine learning algorithms excel at pattern recognition tasks, potentially surpassing human capabilities in identifying subtle abnormalities and maintaining consistent diagnostic criteria application [,]. Furthermore, AI systems can process vast quantities of data instantaneously, enabling real-time interpretation that could transform clinical workflow efficiency [,]. Recent reviews have examined AI applications in general gastroenterology [-]. However, a focused analysis of HRM-specific applications remains lacking.
The evolution of AI methodologies in medical imaging and signal processing has particular relevance to HRM analysis []. Early applications relied on traditional machine learning approaches such as support vector machines and random forests, which required manual feature extraction and engineering [,]. These methods, while showing promise, were limited by their dependence on predefined features and inability to capture complex spatiotemporal patterns inherent to esophageal pressure topography. The advent of deep learning, particularly convolutional neural networks (CNNs), has revolutionized medical image analysis by enabling automatic feature learning directly from raw data [,]. For HRM, this capability allows AI systems to identify novel patterns and relationships that may not be apparent to human observers or captured by traditional metrics. Recent systematic assessments of AI tools in esophageal dysmotility diagnosis have documented the progression from basic automation of landmark identification to sophisticated deep learning models capable of comprehensive Chicago Classification diagnosis []. Contemporary applications now encompass not only HRM but also impedance-pH monitoring, demonstrating the broadening scope of AI in esophageal diagnostics [].
Recent technological advances have further expanded the potential applications of AI in esophageal motility assessment. The integration of complementary diagnostic modalities, such as Functional Luminal Imaging Probe (FLIP) technology and high-resolution impedance manometry, provides multidimensional data that can enhance diagnostic accuracy []. AI platforms have demonstrated 89% accuracy in automated interpretation of FLIP Panometry studies, validating the feasibility of automated esophageal motility classification during endoscopy []. AI systems are uniquely positioned to synthesize these complex, multimodal datasets, potentially revealing pathophysiological insights that single-modality assessment cannot provide []. Moreover, the development of cloud-based computing infrastructure and edge computing capabilities enables the deployment of sophisticated AI models in diverse clinical settings, from tertiary referral centers to community practices [,]. The emergence of generative artificial intelligence and large language model–assisted development has further accelerated model creation, with recent studies demonstrating the successful implementation of Gemini-assisted (Google LLC) deep learning for automated HRM diagnosis, achieving high diagnostic precision across multiple motility disorder categories [].
Despite these promising developments, no comprehensive systematic review has evaluated the full spectrum of AI applications in HRM interpretation or assessed their methodological quality. Therefore, this systematic review aims to (1) systematically evaluate current AI applications in HRM interpretation, (2) assess diagnostic accuracy across different AI methodologies, (3) evaluate methodological quality, and (4) identify barriers to clinical implementation and future research priorities.
Methods
Study Design
The protocol was registered in PROSPERO (International Prospective Register of Systematic Review; CRD420251154237) before initiating the search. This systematic review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 reporting guidelines [] (), PRISMA-Diagnostic Test Accuracy () checklist [], and PRISMA-S (Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Search, an extension to the PRISMA statement for reporting literature searches in systematic reviews; ) checklist [].
Database and Searching Strategy
We searched PubMed/MEDLINE, Embase, Cochrane Library, and Web of Science through September 2025, for studies using AI or machine learning to interpret esophageal HRM. Search strategies incorporated keywords and indexed terms, including (“artificial intelligence” OR “machine learning” OR “deep learning” OR “neural network” OR “computer-aided diagnosis”) AND (“high-resolution manometry” OR “HRM” OR “esophageal manometry” OR “esophageal motility” OR “Chicago Classification”; ). Gray literature sources were searched to reduce publication bias.
Textbox 1. Searching strategy to find the relevant papers. Comprehensive search strategies were used to identify studies on artificial intelligence (AI) applications in HRM across 4 databases. Search strategies used MeSH (Medical Subject Headings) and Emtree keywords searched as free-text terms in titles and abstracts covering: (1) AI/machine learning concepts, (2) esophageal motility disorders and gastrointestinal motility, and (3) HRM/esophageal physiologic testing. Optimizing search sensitivity: we empirically tested both approaches (eg, “Gastrointestinal motility”[tiab] vs “Gastrointestinal motility”[Mesh]) and found that searching MeSH keywords as free-text in (title and abstract [tiab]) yielded more comprehensive results. This captures papers using these established terms that may not yet be formally indexed with the corresponding MeSH headings, or where these concepts appear in titles or abstracts but are not assigned as subject headings. Searches were conducted from database inception through September 24, 2025 (initial search) and updated October 27, 2025, and verified for reproducibility on November 6, 2025, with no language restrictions. The table displays exact search syntax for MEDLINE via PubMed, Embase via OVID, Cochrane Library via Wiley, and Web of Science Core Collection, along with the number of records retrieved from each source (lang: language; ab.ti.kw: abstract, title, and keyword; and ab: abstract).
Database: MEDLINE (through PubMed)
#1 “artificial intelligence”[tiab] OR “machine learning”[tiab] OR “deep learning”[tiab] OR “neural network”[tiab] OR “computer-aided diagnosis”[tiab]: 345034
#2 “high-resolution manometry”[tiab] OR “HRM”[tiab] OR “esophageal manometry”[tiab] OR “esophageal motility”[tiab] OR “Chicago Classification”[tiab] OR “Gastrointestinal motility”[tiab]: 15092
#3 #1 AND #2: 116
#4 #3 AND English[Lang]: 114
Database: Embase-OVID
#1 ‘artificial intelligence’:ab,ti,kw OR ‘machine learning’:ab,ti,kw OR ‘deep learning’:ab,ti,kw OR ‘neural network’:ab,ti,kw OR ‘computer-aided diagnosis’:ab,ti,kw: 173049
#2 ‘high-resolution manometry’:ab,ti,kw OR ‘HRM’:ab,ti,kw OR ‘esophageal manometry’:ab,ti,kw OR ‘esophageal motility’:ab,ti,kw OR ‘Chicago Classification’:ab,ti,kw OR ‘Gastrointestinal motility ‘:ab,ti,kw: 38254
#3 #1 AND #2: 73
#4 #3 AND ([article]/lim OR [article in press]/lim OR [review]/lim) AND [English]/lim: 39
Database: Cochrane Library (Through Wiley)
#1 ‘artificial intelligence’:ab,ti,kw OR ‘machine learning’:ab,ti,kw OR ‘deep learning’:ab,ti,kw OR ‘neural network’:ab,ti,kw OR ‘computer-aided diagnosis’:ab,ti,kw: 11482
#2 ‘high-resolution manometry’:ab,ti,kw OR ‘HRM’:ab,ti,kw OR ‘esophageal manometry’:ab,ti,kw OR ‘esophageal motility’:ab,ti,kw OR ‘Chicago Classification’:ab,ti,kw OR ‘Gastrointestinal motility’:ab,ti,kw: 4636
#3 #1 AND #2: 36
Database: Web of Science
#1 ab=(“artificial intelligence” OR “machine learning” OR “deep learning” OR “neural network” OR “computer-aided diagnosis”): 645285
#2 ab=(“high-resolution manometry” OR “HRM” OR “esophageal manometry” OR “esophageal motility” OR “Chicago Classification” OR ‘Gastrointestinal motility’): 9769
#3 #1 AND #2: 138
Additional information sources were systematically searched to identify gray literature and unpublished studies. We searched the medRxiv preprint server [] using the same search terms to identify studies not yet formally published (advanced searching tab). ClinicalTrials.gov [] was searched to identify ongoing or completed trials that may not have been published. Reference lists of all included studies and relevant systematic reviews were manually screened to identify additional eligible studies. No citation reference searches were performed using citation databases.
The search strategy was peer reviewed by information scientists who have extensive expertise in systematic review methodology and database search strategies.
The results from all database searches were exported and deduplicated using EndNote X20 (Clarivate Analytics, 2020). Automated deduplication was performed using EndNote’s duplicate identification algorithm, followed by manual review to identify and remove any remaining duplicates based on title, author, year, and journal. Two reviewers (CSB and EJG) independently screened studies, and discrepancies were resolved by discussion ().
Inclusion and Exclusion Criteria
We included both prospective and retrospective studies that applied an AI-based algorithm to HRM measurements for diagnosing or classifying esophageal motility disorders (eg, achalasia subtypes, esophagogastric junction outflow obstruction, distal esophageal spasm, hypercontractile esophagus, ineffective motility, etc). We excluded nonhuman studies, conference abstracts without full text, studies focusing on anorectal manometry, and studies on other modalities (such as FLIP or pH-impedance) unless they directly involved HRM data integration.
The detailed inclusion criteria are as follows: (1) original research applying AI, machine learning, or deep learning techniques to HRM data; (2) evaluation of diagnostic accuracy, classification performance, or clinical outcomes; (3) inclusion of human participants or HRM studies; and (4) provision of quantitative performance metrics. The exclusion criteria are as follows: (1) review papers, editorials, or case reports without original data; (2) used only conventional manometry without high-resolution capabilities; (3) applied AI exclusively to other esophageal diagnostic modalities without HRM integration; and (4) lacked sufficient methodological detail for quality assessment.
Data Extraction
Two independent reviewers (CSB and EJG) systematically extracted data using a standardized, prepiloted form. Extracted variables included: study characteristics (authors, year, country, and design), patient demographics (sample size, age, and sex distribution), HRM technical specifications (equipment, protocol, and Chicago Classification version), AI methodology (algorithm type, architecture, and training approach), dataset characteristics (size, split ratios, and validation method), performance metrics (sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve [AUROC]), clinical outcomes when available, and implementation considerations. Discrepancies were resolved through consensus or third reviewer (GHB) arbitration. Authors were contacted for missing or unclear data, with a maximum of 3 contact attempts over 4 weeks.
Study Outcomes
Primary outcome measures included diagnostic accuracy metrics for AI systems compared to expert interpretation as the reference standard. Sensitivity, specificity, positive and negative predictive values, and accuracy were calculated when raw data were available. For studies reporting only AUROC values, these were extracted directly. Meta-analysis was planned if sufficient homogeneity existed across studies; however, due to significant heterogeneity in AI approaches, patient populations, and outcome definitions, a narrative synthesis was performed.
Secondary outcomes included: external validation performance compared to internal validation, processing time for automated interpretation, comparison with trainee interpretation, interrater reliability metrics, and clinical outcomes when reported. Subgroup analyses examined performance differences by: AI methodology (traditional machine learning vs deep learning), disorder category according to the Chicago Classification, validation approach (internal vs external), and year of publication to assess temporal trends.
Quality Assessment
We assessed the methodological quality and risk of bias of each included study using the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies-2) tool. This tool evaluates risk of bias in 4 domains: patient selection, index test, reference standard, and flow and timing. For each domain, we judged the risk of bias as low, high, or unclear based on the information reported in the study, and we also noted any concerns regarding applicability to the review question []. Two reviewers (CSB and EJG) performed the QUADAS-2 assessments independently, with disagreements resolved through discussion.
Results
Study Selection and Inclusion
Literature search yielded 411 studies from databases and 1 additional record from manual screening. After removing duplicates, 175 studies remained. Following title and abstract screening, 100 full-text papers were assessed for eligibility. Of these, 83 were excluded. Ultimately, 17 studies met inclusion criteria (Figure 1).
is the PRISMA flow diagram for systematic review of AI applications in HRM (2013-2025). Literature search across PubMed/MEDLINE, Embase, Cochrane Library, and Web of Science (database inception through November 2025) identified studies applying AI, machine learning, or deep learning techniques to interpret HRM for diagnosis of esophageal motility disorders. The diagram illustrates the screening process.
Figure 1. Study selection flow.
Study Characteristics
Studies were published between 2013-2025, with 82% (14/17) of the studies published in 2020 or later. The studies with clearly documented patient numbers included: Hoffman et al [], with 30 participants with dysphagia, Rohof et al [], 50 patients with gastroesophageal reflux disease, Jungheim et al [] with 15 healthy volunteers, Kou et al [] with 2161 HRM cases, Kou et al [] study with 1741 HRM cases, Wang et al [] with 229 esophageal motility cases from 229 individuals, Surdea-Blaga et al [] with 192 HRM studies (patients), Rafieivand et al [] with 67 patients, Zifan et al [] with 60 patients, and Lankarani et al [] with 43 patients. The total confirmed patient count from studies with explicit numbers was at least 4588 patients, though several studies did not report exact patient numbers. Publication years ranged from 2013 to 2025, with 82% (14/17) published after 2020, reflecting the recent emergence of this field. Study designs were predominantly retrospective cohort studies (n=15, 88%), with 2 methodological development studies (n=2, 12%; Rohof et al [] and Kou et al []). No prospective validation studies were identified. All studies used the Chicago Classification as the reference standard, with varying versions used across studies ().
Table 1. Summary of the included studiesa.
Study and year
Country
Sample size
AIb method
Study aims
Performance
Validation
Chicago classification
Hoffman et al, 2013 []
United States
30 participants
335 swallows
Dysphagia
19 men and 11 women
mean age: 68.0 (SD 11.8) years
Multilayer perceptron artificial neural network
Pharyngeal analysis
7 MBSImPc components
Accuracy: 91%
AUROCd: 0.90-0.98
Internal validation only
Unspecified
Rohof et al, 2014 []
Australia
50 patients
GERDe
33 men and 17 women
Mean age 52 (SD 1.9) years
Linear regression
AIMplotf algorithm
ICCsh 0.95 and 0.94 (intrarater and interrater, respectively)
Inter- and intrarater
v2.0
Jungheim et al, 2016 []
Germany
15 healthy volunteers
8 men and 7 women
Mean 34.9 years
Logistic regression and sequence labeling
Automated calculation of UESi contraction restitution time
Expert comparable values (restitution time of 11.16 ±5.7s and 10.04 ±5.74s (experts), compared to model-generated values from 8.91 ±3.71s to 10.87 ±4.68s)
Expert comparison
v2.0
Jell et al, 2020 []
Germany
15 HRMj for training
25 HRM for validation
Supervised machine learning for automated swallow detection and classification
Automated swallow detection or classification
Accuracy: 97.7%
Sensitivity: 89.7%
Specificity: 83.2%
Internal validation only
Unspecified
Czako et al, 2021 []
Romania
InceptionV3 (Google LLC) CNNk for transfer learning
For probe positioning
IRPl classification
Accuracy: 97%
F1-score >84%
Internal validation only
v2.0
Kou et al, 2021 []
United States
2161 HRM studies
32,415 swallows
Variational autoencoder (unsupervised)
Pattern clustering
Motility phenotypes
3 distinct clusters in HRM amenable to machine learning classification (linear discriminant)
Internal validation only
v2.0
Kou et al, 2022 []
United States
1741 HRM studies
26,115 swallows
Swallow type classification
Peristalsis classification
Swallow type accuracy: 83%
Classification of peristalsis accuracy: 88%
Internal validation only
v3.0
Wang et al, 2021 []
China
229 esophageal motility cases
229 individuals
3D CNN (Conv3D; Google LLC)
Bidirectional convolutional LSTM (BiConvLSTM; Google LLC)
Multiple models (support vector machines, random forest, k-nearest neighbors, and logistic regression)
Automatic classification of functional dysphagia
Accuracy: 91.7%
Precision: 92.86%
Logistic regression produced the best results
Internal validation only
v4.0
Zifan et al, 2024 []
United States
30 healthy participants
30 patients with functional dysphagia
Ensemble methods (gradient boost, support vector machines, and logit boost)
Functional dysphagia versus controls classification
Internal validation only
v4.0
Lankarani et al, 2024 []
Iran
43 dysphagia patients (suspicious achalasia)
Artificial neural network
To compare the findings on HRM and swallowing sounds
Internal validation only
v4.0
Popa et al, 2024 []
Romania
CNN ensemble (LLMn‑assisted)
Esophageal motility disorder diagnosis
Precision: 89%
Accuracy: 88%
Recall: 88%
F1-score: 88.5%
Internal validation only
v3.0
Wu et al, 2025 []
China
Multi-model CNN attention ensemble
Esophageal motility disorder diagnosis
Internal validation only
v4.0
aCharacteristics and outcomes of 17 included studies evaluating artificial intelligence for high-resolution manometry interpretation (2013-2025). Studies encompassed 4588 patients from 6 countries (United States, Romania, Germany, Iran, China, and multicenter European studies) with sample sizes ranging from 15 to 2161 participants. presents: study design (retrospective, prospective, or validation studies), patient population characteristics, artificial intelligence methodology used (traditional machine learning vs deep learning approaches), specific diagnostic tasks (eg, Chicago Classification diagnosis, integrated relaxation pressure classification, and swallow type identification), reference standards used for model training or validation, diagnostic performance metrics (accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve), and key findings.
dAUROC: area under the receiver operating characteristic curve.
eGERD: gastroesophageal reflux disease.
fAIMplot: automated impedance manometry analysis.
gAIM: automated impedance manometry.
hICC: intraclass correlation coefficient.
iUES: upper esophageal sphincter.
jHRM: high-resolution manometry.
kCNN: convolutional neural network.
lIRP: integrated relaxation pressure.
mLSTM: long short-term memory.
nLLM: large language model.
Time Trend of AI Application in HRM Interpretation
The application of AI to HRM interpretation has shown continuous evolution since 2013. Early pioneers such as Hoffman et al (2013) [] applied artificial neural networks to pharyngeal HRM classification, achieving 86.5%-94% accuracy with 335 swallows. During this initial period (2013-2016), researchers focused primarily on automating specific parameter measurements. Rohof et al (2014) [] created the automated impedance manometry analysis automated analysis system with excellent reproducibility (intraclass correlation coefficient: 0.94-0.95), and Jungheim et al (2016) [] applied machine learning to calculate upper esophageal sphincter restitution times.
A methodological shift occurred around 2018 when researchers began adopting deep learning approaches. Jell et al (2020) [] achieved 97.7% accuracy in automated swallow detection using supervised machine learning. The period from 2020-2022 saw widespread adoption of CNNs. Czako et al (2021) [] achieved 97% accuracy for integrated relaxation pressure (IRP) classification using InceptionV3 (Google LLC) CNN with 2437 images. Kou et al (2021) [] developed both an unsupervised variational autoencoder analyzing 32,415 swallows from 2161 patients and a supervised long short-term memory network achieving 83% accuracy []. Wang et al (2021) [] implemented temporal modeling with Bidirectional Convolutional long short-term memory networks, reaching 91.32% overall accuracy. Romanian researchers, including Surdea-Blaga et al (2022) [] and Popa et al (2022) [], achieved 86% and 94% accuracy, respectively, for Chicago Classification automation.
Recent studies from 2023 onwards have explored increasingly sophisticated and diverse approaches. Zifan et al (2023) [] used shallow machine learning approaches, including logistic regression, random forests, and k-nearest neighbors, to analyze distension-contraction patterns in 60 patients with functional dysphagia, achieving 91.7% accuracy with logistic regression for proximal segments and 90.5% with random forests for distal segments. Rafieivand et al (2023) [] developed a fuzzy framework with graphical neural network interpretation, achieving 78% single-swallow accuracy but 92.54% patient-level accuracy in 67 patients. Zifan et al (2024) [] further refined their approach using support vector machines to analyze distension-contraction plots, achieving an AUROC of 0.95 in 60 patients. Lankarani et al (2024) [] pioneered noninvasive acoustic analysis combined with AI, achieving 97% accuracy for IRP prediction in 43 patients. Most recently, studies have incorporated large language models, with Popa et al (2024) [] integrating Gemini with deep learning, while Wu et al [] (2025) developed mixed attention ensemble approaches ().
Diagnostic Accuracy Across Studies
Overall diagnostic accuracies ranged from 78% to 97% across the 17 included studies. The highest accuracies were achieved for specific applications: IRP classification (97%) [], acoustic IRP prediction (97%) [], and swallow detection (97.7%) []. For Chicago Classification automation, accuracy varied from 86% to >93% [,]. Functional dysphagia studies demonstrated segment-specific performance differences, with Rafieivand et al [] highlighting the importance of patient-level versus swallow-level accuracy (92.54% vs 78%).
Notably, none of the studies provided detailed performance metrics for individual Chicago Classification categories, such as achalasia subtypes or specific motility disorders. This absence of disorder-specific sensitivity and specificity data limits understanding of AI performance across the full spectrum of esophageal pathology and represents a critical gap for clinical implementation ().
Methodological Quality
QUADAS-2 assessment revealed variable methodological quality across the 17 included studies (). For the patient selection domain, no studies demonstrated low risk of bias, with 14 (82%) studies showing unclear risk primarily due to unreported sampling methods, and 3 (18%) studies showing high risk: Hoffman et al [] included only disordered cohorts without healthy controls, Jungheim et al [] tested only healthy volunteers limiting representativeness, and Lankarani et al [] had a small specialized cohort.
Table 2. QUADAS-2a methodology quality assessment for included studiesb.
U: calibrated on the same dataset, raising overfitting concerns
U: reproducibility focus, not diagnostic
L: complete data, no losses
Jungheim et al, 2016 []
H: healthy only; not representative
U: small n=15, overfit concern
L: reference standard measurements (eg, UESf metrics) and experienced assessors
L: all volunteer data used
Jell et al, 2020 []
U: sampling method not reported
L: supervised machine learning clear model
L: expert annotation
L: all data included
Czako et al, 2021 []
U: sampling method not reported
L: InceptionV3 (Google LLC) with held-out test
L: expert Chicago‑consistent labels
U: 8 patients excluded, and completeness uncertain
Kou et al, 2021 []
U: unclear enrollment method
L: variational autoencoder
H: no validated reference standard
L: all data included
Kou et al, 2022 []
U: unclear enrollment method
L: separate test set; blinded automated inference
L: expert Chicago‑consistent labels
L: all data included
Wang et al, 2021 []
U: unclear enrollment method
L: train, validation, or test separation
L: expert Chicago‑consistent labels
L: all data included
Kou et al, 2022 []
U: unclear enrollment method
L: independent test cohort; rule-based aggregation of swallow‑level models
L: expert Chicago‑consistent labels
L: all data included
Surdea-Blaga et al, 2022 []
U: no explicit enrollment stated
L: CNNsg with hold‑out evaluation
L: expert Chicago‑consistent labels
L: all data included
Popa et al, 2022 []
U: spectrum bias
L: CNN with internal split
L: expert Chicago‑consistent labels
H: excluded indeterminate cases
Rafieivand et al, 2023 []
U: single‑center, small n; sampling not described
L: composite (graph + fuzzy) model
L: expert Chicago‑consistent labels
L: all data included
Zifan et al, 2023 []
U: unclear enrollment method
L: multiple machine learning models with cross-validation
U: details of reference adjudication limited
L: all data included
Zifan et al, 2024 []
U: unclear enrollment method
L: multiple machine learning models with cross-validation
U: details of reference adjudication limited
L: all data included
Lankarani et al, 2024 []
H: small, specialized cohort
L: artificial neural network model
L: expert Chicago‑consistent labels
L: all data included
Popa et al, 2024 []
U: unclear enrollment method
L: LLMh‑assisted pipeline
L: expert Chicago‑consistent labels
L: all data included
Wu et al, 2025 []
U: unclear enrollment method
L: ensemble with cross-validation or hold-out
L: expert Chicago‑consistent labels
L: all data included
aQUADAS-2: Quality Assessment of Diagnostic Accuracy Studies-2.
bQuality Assessment of Diagnostic Accuracy Studies-2 evaluation of methodological quality and risk of bias for 17 included artificial intelligence studies in high-resolution manometry (2013-2025). Assessment evaluated four domains: (1) patient selection—risk of bias from inappropriate patient selection, exclusions, or case-control design; (2) index test—risk of bias from artificial intelligence model training or validation procedures and threshold determination; (3) reference standard—risk of bias from expert interpretation methods and blinding; and (4) flow and timing—risk of bias from incomplete data or variable intervals between index test and reference standard. Each domain was rated as low risk (L), high risk (H), or unclear risk (U) of bias. Applicability concerns assessed whether study design, patient population, artificial intelligence methodology, or reference standards differed from the review question. The table demonstrates predominant unclear risk in patient selection (14/17, 82% of studies) due to inadequate reporting of recruitment methods, while the index test domain showed the strongest methodological rigor (88% low risk).
cH: high risk.
dL: low risk.
eU: unclear risk.
fUES: upper esophageal sphincter.
gCNN: convolutional neural network.
hLLM: large language model.
The index test domain showed the strongest methodological rigor, with 15 (88%) studies demonstrating low risk of bias through appropriate model training and validation separation. Only 2 (12%) studies showed unclear risk: Rohof et al [] due to calibration on the same dataset raising overfitting concerns, and Jungheim et al [] due to the small sample size (n=15), creating uncertainty in algorithm performance.
For the reference standard domain, 14 (82%) studies had a low risk of bias using expert-determined Chicago Classification labels. Further, 3 (18%) studies showed unclear risk: Rohof et al [] focused on automated metric agreement rather than diagnostic ground truth, and both studies by Zifan et al [,] had limited details on reference adjudication. One study by Kou et al [] showed a high risk as it lacked a validated reference standard for unsupervised clusters.
Flow and timing assessment revealed low risk in 15 (88%) studies, with all patient data included in analyses. One study showed unclear risk (Czako et al []) due to the exclusion of 8 patients with probe-placement failure, and 1 study (Popa et al []) demonstrated high risk by excluding indeterminate cases from analysis, introducing potential spectrum bias.
The predominance of unclear risk in patient selection highlights a systematic reporting deficiency across the literature, with most studies failing to document recruitment and enrollment methods adequately. This pattern, combined with the complete absence of external validation noted elsewhere, raises concerns about the generalizability and real-world applicability of these AI systems.
Secondary Findings
None of the 17 included studies performed external validation using datasets from different institutions or periods. All studies relied on internal validation methods, including train-test splits, k-fold cross-validation, or other internal validation approaches. This complete absence of external validation represents a critical limitation in assessing the generalizability of AI models for HRM interpretation. Studies using k-fold cross-validation [,,,,] reported more conservative performance estimates compared to simple train-test splits, suggesting potential overfitting in single-split validation approaches.
Discussion
Principal Findings
The systematic synthesis of current evidence reveals that AI applications in HRM have demonstrated strong technical performance, with diagnostic accuracies ranging from 78% to 97%, while facing substantial translational challenges. The evolution from traditional machine learning algorithms (86.5%-94% accuracy) to deep learning architectures capable of 97% accuracy for specific tasks represents significant technological progress [,,]. These advances occur within the broader context of AI transformation in gastroenterology, where similar trajectories have been observed in colonoscopy, capsule endoscopy, and inflammatory bowel disease assessment, suggesting that the integration of AI into clinical gastroenterology practice is inevitable rather than speculative [,].
The innovation of AI in HRM extends beyond mere automation. These systems represent a major change in how we approach esophageal motility diagnostics [-], offering solutions to important clinical needs: the global shortage of motility experts, the need for rapid and consistent interpretation [], and the potential for telemedicine integration to serve underserved areas [,].
The diagnostic accuracy achieved by current AI systems, particularly for IRP classification and automated Chicago Classification, addresses a fundamental limitation of HRM interpretation: interobserver variability. AI systems maintain consistent diagnostic criteria application while human experts demonstrate significant intraobserver variability on repeated assessments. This consistency could enable more reliable phenotyping of esophageal motility disorders, facilitating precision medicine approaches that move beyond categorical diagnoses to individualized pathophysiological assessment. The superior performance of AI in quantitative parameter calculation eliminates measurement variability that has plagued HRM interpretation since its inception [].
These accuracy levels have important implications for clinical practice. With health care systems facing increasing pressure to reduce costs while improving outcomes, AI-enabled HRM interpretation could decrease repeat procedures and reduce unnecessary testing costs [,]. Moreover, the consistent application of diagnostic criteria could reduce misdiagnosis-related treatment failures that currently affect a considerable number of patients with esophageal motility disorders [,].
However, the apparent success of AI systems must be contextualized within significant methodological limitations identified through quality assessment. Most critically, no studies demonstrated low risk of bias in patient selection, with 82% (14/17) showing unclear risk due to unreported sampling methods and 18% (n=3) showing high risk due to biased cohort selection [,,]. This systematic deficiency in documenting recruitment and enrollment methods raises fundamental questions about the representativeness of training datasets. The complete absence of external validation across all 17 studies compounds these concerns about generalizability. Internal validation consistently overestimates model performance, and the lack of testing on datasets from different institutions, HRM systems, or patient populations means we have no evidence of real-world performance [].
The complete absence of prospective clinical trials represents the most critical barrier to clinical translation. While retrospective studies demonstrate technical feasibility with accuracies of 78%-97%, these controlled environments fail to capture the complexities of real-world clinical practice. Prospective trials are essential to evaluate: (1) how AI systems perform with real-time data acquisition variability, (2) whether AI recommendations alter clinical decision-making, (3) patient outcomes following AI-guided treatment, and (4) integration challenges within existing clinical workflows. Without such evidence, even the most accurate AI models remain research tools rather than clinical instruments [-].
The evolution through distinct phases of AI development in HRM mirrors broader trends in medical AI but also reveals unique challenges specific to esophageal motility assessment. The transition from traditional machine learning to deep learning approaches yielded substantial performance improvements, yet the “black box” nature of deep learning models poses particular challenges in a field where pathophysiological understanding drives therapeutic decision-making []. Clinicians require not just diagnostic labels but mechanistic insights that inform treatment selection between medical therapy, endoscopic intervention, or surgical management. The development of explainable AI models that provide interpretable features and confidence metrics represents a critical priority for clinical acceptance []. Recent advances in attention mechanisms and gradient-based visualization techniques, as demonstrated in the Popa et al [] study using LIME (Local Interpretable Model-Agnostic Explanations), offer promising approaches for making AI decision-making transparent and clinically meaningful.
The integration of multiple diagnostic modalities through AI platforms addresses a longstanding limitation of isolated HRM interpretation. The combination of manometric, impedance, and complementary data provides a more comprehensive assessment of esophageal function than any single modality alone []. AI systems excel at synthesizing these complex, multidimensional datasets, potentially revealing pathophysiological patterns invisible to conventional analysis. The Zifan et al (2023 [] and 2024 []) work on distension-contraction plots illustrates how AI can extract diagnostic value from data presentations that challenge human interpretation. This capability becomes particularly relevant with the Chicago Classification version 4.0 emphasis on provocative testing and positional changes, which generate substantially more data requiring integration and interpretation [].
The absence of disorder-specific performance metrics across all 17 studies severely limits clinical applicability. While overall accuracy appears promising (86%-97%), clinicians need to know how AI performs for specific conditions: distinguishing achalasia subtypes (critical for treatment selection), detecting subtle ineffective esophageal motility (often missed by novices), or identifying rare disorders such as jackhammer esophagus. A system with 95% overall accuracy but poor performance in type II achalasia, for instance, could lead to inappropriate treatment recommendations. Future studies must report sensitivity and specificity for each Chicago Classification category to enable informed clinical decision-making.
Implementation barriers identified across studies reveal a complex interplay of technical, regulatory, clinical, and economic factors. The incompatibility with existing HRM systems reflects the proprietary nature of medical device software and the lack of interoperability standards. The regulatory uncertainty surrounding AI medical devices requires proactive engagement between developers, clinicians, and regulatory agencies to establish appropriate evaluation frameworks [,]. Despite these barriers, the economic rationale for AI implementation is strong. High-volume centers could achieve cost-effectiveness through improved workflow efficiency and reduced need for expert consultation [,,], though specific economic analyses are needed to quantify these benefits. The lack of specific reimbursement codes for AI-assisted interpretation creates financial uncertainty that discourages adoption []. The potential for AI to enable task-shifting from specialists to general gastroenterologists could address workforce shortages and improve access to motility assessment, particularly in underserved areas.
The ethical implications of AI implementation in HRM diagnostic practice deserve careful consideration []. The potential for algorithmic bias, particularly affecting populations underrepresented in training datasets, could exacerbate existing health care disparities. The predominance of studies from North American, European, and select Asian centers raises concerns about applicability to African, Latin American, and other underrepresented populations with different disease phenotypes and genetic backgrounds []. Development of quality assurance programs that monitor AI performance and identify edge cases requiring human review will be essential for maintaining patient safety.
Moving from laboratory validation to clinical implementation requires addressing multiple translational gaps simultaneously. First, prospective multicenter trials must demonstrate that AI systems maintain performance across diverse patient populations, HRM equipment, and clinical settings. Second, health economic analyses must quantify whether efficiency gains justify implementation costs—a critical requirement for hospital administrator buy-in and insurance coverage. Third, regulatory pathways need clarification: Should AI-HRM systems be classified as clinical decision support tools or diagnostic devices? Each classification carries different validation requirements and liability considerations. Finally, implementation science research must address workflow integration, user training requirements, and change management strategies to ensure successful adoption [].
Future priorities must focus on multicenter validation studies, development of explainable AI models, integration with evolving diagnostic frameworks, and systematic addressing of regulatory and economic barriers. The ultimate success of AI in HRM will depend not on technological sophistication alone but on thoughtful integration that preserves clinical judgment while enhancing diagnostic accuracy and efficiency. To achieve clinical translation, the field must transition from technical validation to clinical validation through (1) prospective trials comparing AI-assisted versus standard interpretation on patient outcomes, (2) disorder-specific performance benchmarking across all Chicago Classification categories, (3) cost-effectiveness analyses demonstrating economic value, (4) regulatory sandbox programs allowing controlled real-world testing, and (5) implementation science studies optimizing integration strategies. Until these translational requirements are met, AI in HRM will remain a promising technology awaiting clinical realization.
Study Limitations
This systematic review has several limitations that should be considered when interpreting the findings. First, the heterogeneity in AI methodologies, patient populations, and outcome definitions precluded meta-analysis, limiting our ability to provide pooled estimates of diagnostic accuracy. Second, we excluded non-English language publications, potentially missing relevant studies from non–English speaking countries. Third, the absence of standardized reporting guidelines for AI studies in HRM made quality assessment challenging, particularly regarding technical aspects of model development. Fourth, publication bias could not be formally assessed due to the diversity of study designs. Fifth, the lack of clinical outcome data across all studies prevented assessment of the real-world impact of AI implementation on patient care, treatment decisions, and health care costs. Finally, critical limitations include the complete absence of low-risk patient selection across all studies, the lack of disorder-specific performance metrics for individual Chicago Classification categories, the absence of prospective clinical trials, no cost-effectiveness analyses, and insufficient direct comparisons between AI and human interpreters using standardized metrics. These gaps collectively limit our ability to assess the true clinical utility and implementation readiness of AI systems in HRM interpretation.
Conclusions
This systematic review provides comprehensive evidence that AI applications in HRM have achieved remarkable technical capabilities while facing substantial challenges in clinical translation. The diagnostic accuracies of 78%-97% demonstrate the potential for AI to standardize and enhance HRM interpretation. However, the complete absence of external validation, systematic deficiencies in patient selection documentation, and lack of clinical outcome studies highlight the critical gap between technological capability and clinical utility. Additionally, the limited reporting of patient demographics across included studies—reflecting the methodological focus of AI development papers—represents an ongoing challenge for assessing generalizability across diverse populations. Future AI validation studies should systematically report demographic characteristics, including age, sex, race or ethnicity, and geographic location, to enable evaluation of algorithmic performance across patient subgroups and identify potential disparities in diagnostic accuracy that could affect equitable clinical implementation.
All the data are accessible and available upon reasonable request to the corresponding author.
This research was supported by the Bio&Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government (MSIT; No. RS-2023-00223501).
None declared.
Edited by S Brini; submitted 03.Oct.2025; peer-reviewed by X Liang, PJ Kahrilas, S Ho Choi; comments to author 23.Oct.2025; revised version received 06.Nov.2025; accepted 06.Nov.2025; published 27.Nov.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
India all-rounder Deepti Sharma commanded the second-biggest price in Women’s Premier League history, but Australia captain Alyssa Healy went unsold at the 2026 auction.
Deepti, 28, returns to UP Warriorz for 3.2 crore Indian rupees (approx…