NEW YORK, Nov 26 (Reuters) – Jamie is enjoying some well-deserved time off, but the Reuters markets team will still keep you up to date on what animated markets today. I’d love to hear from you so please feel free to reach out at saqib.ahmed@thomsonreuters.com, opens new tab
Today’s Key Market Moves
Sign up here.
On Wall Street the benchmark S&P 500 (.SPX), opens new tab and tech-heavy Nasdaq (.IXIC), opens new tab were up about 0.7% and 0.8%, respectively. The Dow was 0.7% higher
U.S. Treasury yields were mixed on Wednesday as stronger-than-expected economic data fueled selling but a sharp rally in UK government bonds helped limit the downside ,
The dollar fell against the euro but appreciated against the battered Japanese yen
New York crude oil futures rose, pulling away from near one-month lows
Gold bullion extended its rise to a near two-week high
Today’s Key Reads
Wall St extends rally on growing bets for December Fed rate cut
Small US retailers face holiday supply chaos due to Trump tariffs
World’s central banks are wary of AI and struggling to quit the dollar, survey shows
US weekly jobless claims at seven-month low as layoffs remain low
UK’s Reeves comes back for more tax to bolster finances
It’s all about the Fed
Wall Street kept the party going for a fourth straight session, with investors betting that the Fed will deliver a rate cut in December.
Tech stocks led the bounce after getting hammered in mid-November. Dell’s bullish AI-server forecasts helped lead the charge. The market action proved once again that “buy the dip” is alive and well on Wall Street.
AI-heavyweight Nvidia rebounded from a 2.6% drop in the prior session and declines in three of the past four, to rise more than 1% on Wednesday.
Keep this up and the S&P 500 could avoid breaking its impressive six-month winning streak.
Expectations for rate cuts have been reinforced in recent days after comments from San Francisco Federal Reserve Bank President Mary Daly and Fed Governor Christopher Waller in support of a December cut.
This even as fresh data showed the job market is holding up just fine — which means the Fed has less reason to rush those rate cuts. Jobless claims actually fell to a seven-month low last week.
For now, the economy is pulling off a neat balancing act: not crashing, but just soft enough to give the Fed room to keep cutting rates.
Still, investors would do well to remember that Friday’s short trading session could spring a surprise. Thin crowds and low liquidity can make for wild swings in either direction. Don’t say we didn’t warn you.
Initial claims
Continuing claims
A line chart with the title ‘What will the Federal Reserve do with interest rates?’
What could move markets tomorrow?
(U.S. markets are closed on Thursday, November 27, for Thanksgiving Day)
Statistics Canada is set to release third-quarter gross domestic product data.
Opinions expressed are those of the author. They do not reflect the views of Reuters News, which, under the Trust Principles, opens new tab, is committed to integrity, independence, and freedom from bias.
Trading Day is also sent by email every weekday morning. Think your friend or colleague should know about us? Forward this newsletter to them. They can also sign up here.
Reporting by Saqib Iqbal Ahmed in New York, editing by Deepa Babington
Our Standards: The Thomson Reuters Trust Principles., opens new tab
Opinions expressed are those of the author. They do not reflect the views of Reuters News, which, under the Trust Principles, is committed to integrity, independence, and freedom from bias.
Omnicom Completes Acquisition of Interpublic, Forming the World’s Leading Marketing and Sales Company, Built for Intelligent Growth in the Next Era Omnicom Group
To Win, Omnicom Must Kill Its Darlings ADWEEK
Intended or not, the new Omnicom will forever change agencies as we’ve known them Digiday
IPG-Omnicom merger nears end; India leadership by Dec 2 | PUMA appoints Ramprasad Sridharan MD | Govt slams gaming firms in SC over PROGA Storyboard18
Omnicom set to complete Interpublic acquisition as EU approves deal IBC.org
Adolescents and young adults experience high rates of mental distress, with substance use and mood-related and anxiety disorders being among the most prevalent issues []. Significant mental distress triggered by the challenges encountered during this transitional stage in life, such as financial instability, interpersonal relationships, and career development [], has been implicated in adolescents and young adults’ decreased quality of life and increased suicide risk []. Adolescents and young adults also exhibit elevated rates of health-risky behaviors, such as poor dietary choices, inadequate sleep, and physical inactivity []. These behaviors are intricately linked with biological and psychosocial factors, including neurological changes, adverse childhood experiences, and peer pressure, which in turn exacerbate the incidence of chronic disease and mental distress among adolescents and young adults []. Despite these alarming trends, adolescents and young adults are less likely to seek health support, particularly for sensitive topics such as sexual and physical abuse, sexually transmitted infections and HIV, contraception methods, and substance use []. The majority of adolescents and young adult clinical patients reported unmet supportive care needs, with psychological needs being the most frequently cited, followed by needs of physical and daily living [,]. Moreover, traditional pediatric and adult interventions are predominantly disease-centric and often fail to address the nuanced, age-specific needs of adolescents and young adults []. Unlike children, whose parents typically make health care decisions on their behalf, or mature adults, who are expected to independently manage their appointments and treatments, adolescents and young adults occupy a transitional phase that shares characteristics with both groups but fully aligns with neither []. They have limited experience navigating health care systems or seeking external support, while simultaneously grappling with issues of identity, independence, and major life milestones []. These challenges highlight significant gaps in current promotive efforts targeting adolescents and young adults, which often struggle to provide effective, age-appropriate care due to workforce shortages and time constraints, underscoring the urgent need for tailored, flexible interventions that can address the complex and diverse health needs of this population [].
Chatbots are innovative digital tools that simulate conversations with users through a dialog interface, generating responses based on stored patterns []. Emerging evidence suggests that chatbots can effectively mitigate symptoms of mental health problems and encourage positive health behaviors [,]. For instance, studies have highlighted the efficacy of chatbot interventions in delivering cognitive-behavioral therapy, mindfulness-based practices, and motivational interviewing techniques for people with psychological distress and drug addiction [,]. Moreover, chatbots have also been shown to improve user adherence and satisfaction with treatment, which could be essential factors in achieving sustained long-term health outcomes [,]. Adolescents and young adults are particularly well-positioned to benefit from chatbots, given their favorable attitudes and openness to innovative health care solutions []. This population often experiences increased vulnerability related to identity formation, academic pressures, and relationship dynamics, while simultaneously possessing strong self-directed learning abilities and a preference for autonomy, making them more receptive to digital health solutions compared to children and older adults []. Autonomous chatbots hold a unique advantage by being perceived not only as easily accessible and nonjudgmental [], but also as capable of fostering a sense of peer support, which is a critical source of empowerment that provides invaluable information and psychological solace to adolescents and young adults [].
Existing reviews on the effectiveness of chatbots in health care have primarily focused on general populations, with limited focus on adolescents and young adults [,]. A recent randomized controlled trial (RCT) found that adolescents and young adult users often perceived the chatbot content as irrelevant or too generic, largely due to insufficient tailoring to personal needs []. Given the unique developmental, social, and technological contexts that characterize this demographic, it is necessary to systematically evaluate the evidence regarding chatbot interventions targeting adolescents and young adults. Moreover, the diversity in chatbot designs and targeted health outcomes requires a comprehensive synthesis to uncover limitations and highlight areas for future research within this population. Present studies often conflate chatbots with other types of conversational agents, such as voice-based virtual agents, embodied avatars, and social robots [,], overlooking the unique advantages of chatbots, particularly their ability to encourage adolescents and young adults to discuss sensitive topics anonymously without fear of judgment. This aspect is often less pronounced in interactions with avatars, robots, or conversations embedded in virtual reality, where social cues may inhibit open communication for those experiencing anxiety or discomfort in social situations []. The text-based nature of chatbots not only facilitates rapid information exchange but also allows users to read and review content repeatedly with unlimited, round-the-clock access. This feature enables users to process and reflect on information at their own pace and take positive actions, as it removes the pressure of maintaining a continuous dialog or responding in real time []. Furthermore, chatbots stand out for their accessibility and cost-effectiveness, as they can be deployed on commonly used platforms such as smartphones and tablets. This eliminates the need for expensive equipment or immersive environments, significantly enhancing their reach and usability and making them widely available to users across diverse socioeconomic backgrounds and settings [].
Generative artificial intelligence (AI) has brought chatbots like ChatGPT (OpenAI Inc) and Llama (Meta Inc) to the forefront of digital health innovation. These advanced systems, powered by natural language processing (NLP) and large language models, offer enhanced capabilities for processing complex information, enabling more human-like and adaptive responses to self-care needs []. Such flexibility better positions chatbots as promising tools, particularly beneficial for adolescents and young adults who may not proactively seek support from health care professionals or prefer to self-manage their health conditions. At present, there is no established gold standard for engineers to assess the development of chatbots and the quality of information they provide. There is also a lack of systematic evidence regarding their effectiveness for adolescents and young adults across various dialog systems (ie, rule-based, retrieval-based, or generative) and design features (eg, modalities, reminders, and frequency of sessions). These knowledge gaps must be addressed to effectively inform and guide future advancements in the field of chatbot development for health care applications for adolescents and young adults. This systematic review and meta-analysis aims to synthesize the evidence from randomized controlled trials (RCTs) to evaluate the effectiveness of AI chatbots in alleviating mental distress and promoting health-related behaviors among adolescents and young adults. Additionally, this study summarizes key design features of chatbots and examines how these characteristics may moderate intervention outcomes through subgroup analyses and meta-regression. User engagement and experiences with chatbot interactions are also explored and synthesized narratively. By addressing these objectives, the review seeks to provide valuable insights for the development and integration of innovative chatbot-based health care solutions, thereby supporting the enhancement of well-being among adolescents and young adults worldwide. The review questions are as follows:
What is the effectiveness of chatbots in alleviating mental distress and promoting health behaviors among adolescents and young adults?
What are the key design features of chatbots, and how do these features impact health outcomes in adolescents and young adults?
How do adolescents and young adults engage with chatbots, and what are their perceptions and experiences during these interactions?
Methods
Protocol Registration and Study Design
The review protocol was prospectively registered in PROSPERO (International Prospective Register of Systematic Reviews), CRD42024603472, and adhered to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 ().
Data Sources and Search Strategy
We conducted a systematic search across 8 databases (PubMed, PsycINFO, Cochrane Library, CINAHL, Embase, Web of Science, Scopus, and IEEE Xplore) using a wide array of search terms (Table S1 in ). Both subject headings (eg, Mesh and Emtree) and free-text keywords related to the core concepts, along with their synonyms and variants, were included. Additionally, the reference lists of previous reviews [,] and the included original studies were manually examined to identify any further eligible studies. The search covered all data from January 1, 2014 to January 26, 2025. This timeframe was selected because the chatbot powered by NLP and machine learning beyond simple rule-based systems began to have significant development and application in health care. This period also coincides with the widespread adoption of internet-connected mobile devices among adolescents and young adults, a group uniquely shaped by and deeply embedded in this digital landscape, ensuring that the evidence included is both technologically relevant and contextually appropriate to their experiences and behaviors. We fine-tuned our search strategy based on previous systematic reviews [,] to locate sources related to chatbots for alleviating mental distress or promoting health-related behaviors. The search was limited to English-language publications. After removing duplicates, 2 reviewers screened all titles and abstracts for eligibility independently. Subsequently, the full-text review was also performed by 2 reviewers, with any disagreements resolved through consultation with a third reviewer.
Eligibility Criteria
We developed our eligibility criteria based on the population, intervention, comparison, outcome, study design (PICOS) framework ():
Population: adolescents and young adults, typically characterized as individuals aged between 15 and 39 years [], in both clinical and nonclinical samples. Given varying definitions of adolescents and young adults by age and to ensure comprehensive inclusion of related studies, we included original research articles if over 50% of participants fell within the 15‐39 years age range, the average age of participants was within this range, or the study explicitly identified its population as “adolescents and young adults.”
Intervention: 2-way interactive chatbots designed primarily to alleviate mental distress or promote health behaviors. These chatbots should operate autonomously without human assistance and serve as the primary component of interventions irrespective of dialog initiatives, interaction modalities, platforms, and settings, but should not be embedded as secondary elements within other technologies, such as virtual reality, robots, or virtual avatars. They may have minor supplementary elements (eg, educational materials) or a simple graphical representation (eg, an icon or avatar), but their primary mode of interaction is through written dialog. Studies focused solely on the development or rationale of chatbot technology, without any empirical evaluation of user-chatbot interaction, were excluded.
Comparator: any control groups that did not involve chatbot technology, such as active controls (eg, treatment as usual), information controls (eg, e-book), and passive controls (eg, waitlist, assessment-only).
Outcome: eligible primary outcomes included mental health outcomes specified in the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) [], as well as health behaviors, defined as actions taken by individuals that affect health or mortality, such as substance use, physical activity, and dietary habits []. Metrics related to user engagement with chatbots (eg, retention rates and frequency of interactions) and user experience (eg, satisfaction, acceptability, and usability) were also concluded when reported alongside primary outcomes.
Study design: RCTs. Studies were excluded if they were conference abstracts, preprints without peer review, or if the full text was unavailable. Publications that did not present original research findings, including editorials, letters, comments, trial registrations, and study protocols, were also excluded.
Table 1. Eligibility criteria (PICOS framework).
Category
Inclusion criteria
Exclusion criteria
Population
Studies were included if they were about adolescents and young adults, which could be shown by:
Over 50% of participants were within 15‐39 years
The average age was within 15‐39 years
The study explicitly identified its population as “adolescents and young adults.”
Studies that did not report any information about age groups
Intervention
2-way interactive chatbots:
With the aim of alleviating mental distress or promoting health behaviors
Operating autonomously without human assistance
Serving as the primary component of the intervention
Primary interaction is through written dialog
Chatbots embedded as secondary elements in other technologies (eg, VR, robots, and virtual avatars)
Studies focused solely on development or rationale without empirical evaluation of user interaction
Comparator
Active controls (eg, treatment as usual)
Information controls (eg, e-books)
Passive controls (eg, wait-list, assessment-only)
Control groups that involved another chatbot technology
Outcome
Primary outcomes:
Mental health outcomes specified in the DSM-5 []
Health behaviors (eg, substance use, physical activity, and dietary habits) []
Secondary outcomes:
User engagement (eg,retention rates, frequency of interactions)
User experience (eg,satisfaction, acceptability, and usability)
Studies that reported only on secondary metrics without any primary outcomes
Study design
Conference abstracts
Preprints without peer review
Unavailable full text
Nonoriginal research (eg, editorials, letters, trial registrations, and study protocols)
aPICOS: population, intervention, comparison, outcome, study design.
bVR: virtual reality.
cDSM-5: Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition.
dRCT: randomized controlled trial.
Data Extraction
We developed a comprehensive data extraction form on Microsoft Excel. The following data were extracted from all included studies: publication details (title, author, and year), study details (study design, region, and recruitment setting), participant characteristics (sample type, sample size, and demographics), chatbot intervention characteristics (name, duration, therapeutic approach, session, and safety measures), and chatbot design features (deployment, delivery platform, dialog system methods, AI technique, and interaction mode). For quantitative analysis, we extracted outcomes and their measures related to targeted conditions, including mental distress (eg, depressive, anxiety, and psychosomatic symptoms), health-related behaviors (eg, physical activity, dietary habits, and substance use). We also extracted and narratively synthesized data related to user engagement (eg, frequency of interactions, number of engaged sessions, and active days) and experience (eg, open-ended feedback, satisfaction, and perceived usability) with chatbots. The data extraction was processed by one reviewer, and then cross-checked by a second reviewer. Any disagreements between reviewers have been resolved through consensus with the involvement of a third reviewer.
Statistical Analysis
A comprehensive narrative synthesis was conducted to systematically summarize study characteristics, chatbot design features, user engagement metrics, and qualitative findings regarding user experience. This approach involved extracting and thematically analyzing relevant data from included studies to identify patterns, barriers, and facilitators of effective chatbot implementation. To assess the effectiveness of chatbot interventions, we conducted a meta-analysis on RCTs wherein participants were randomly assigned to an experimental group receiving a target chatbot intervention or to a control group. We conducted meta-analyses for overall mental distress and specific symptoms reported by at least 3 trials, including depression, anxiety, positive affect, negative affect, stress, and well-being. Given the focus of included studies spanned a wide range of health-related behaviors, we estimated pooled effect sizes for an overall behavioral health outcome, including sleep-related safety behaviors, stress management, mindfulness, cigarette abstinence, and pain coping. Additionally, general outcomes related to psychological and physical health, such as life satisfaction and self-efficacy, were analyzed as well.
The analyses were conducted using the Review Manager (RevMan; The Cochrane Collaboration) 5.4 [] and Stata MP 18 (StataCorp LLC) []. The standardized mean difference (SMD) with a 95% CI was used to compute the effect size of the continuous statistics as different measurement tools were used for the same outcomes across trials. To combine outcomes reported in continuous and categorical formats, odds ratios were transformed into SMD []. Heterogeneity among studies was assessed using the I² statistic and the Cochran Q statistic. The random effect model was used to account for moderate to high heterogeneity across studies. We calculated SMD using postintervention outcome data that provided means and SDs. When both intention-to-treat and completer analyses were reported, the former was prioritized for analysis. For studies with multiarm designs that included multiple experimental or control groups, we combined the means and SDs from the different arms to create a single pair-wise comparison, as suggested by the Cochrane guidelines for integrating multiple groups from a single study []. If a study did not report sufficient data (mean, SD, SE, 95% CI, and sample size) to calculate SMD, we contacted corresponding authors for missing data; studies lacking necessary data were excluded from meta-analysis. For sensitivity analysis, we used a “leave-one-out” method to identify influential studies and assess the robustness of estimates.
We conducted a series of subgroup analyses on the primary outcomes to explore potential moderators. Informed by prior research [], we examined three study characteristics (ie, control group types, intervention duration, and target sample), as well as four chatbot features (ie, dialog system methods, reminders, interaction mode, and deployment formats) as potential moderators of intervention effects. Specifically, we explored three types of control group (ie, active, information, and passive controls), considering that differences in the nature of participant engagement could influence observed effect sizes; intervention duration was examined as it may impact the sustainability of chatbot effects; the target sample (ie, clinical, subclinical, and nonclinical) was included to account for baseline differences in health status that could moderate intervention outcomes []. In addition, 3 primary dialog system methods for input processing and response generation were examined: rule-based, retrieval-based, and generative models []. Rule-based chatbots operate on a predefined set of rules, producing predictable responses that are inherently limited in scope. Retrieval-based chatbots select responses from a predefined database of possible answers, enabling some level of contextual understanding while remaining constrained by the availability of their resources. Generative chatbots learn patterns from large datasets and create new, dynamic content, offering greater flexibility to handle diverse and complex conversations []. Further, we classified chatbots as those with reminders or those without. Chatbot reminders can serve various functions, including login prompts, system greetings, and mood tracking notifications. For interaction modes, we differentiated between chatbots delivering text-only interactions and those incorporating multimedia materials, such as videos or images. Finally, for deployment, we categorized chatbots as either standalone apps or web-based tools, with the latter being integrated into instant messengers or accessed via websites. Additionally, meta-regression analyses were conducted for continuous variables (ie, gender) when there were at least 10 observations available []. Funnel plots and Egger test were used to explore publication bias for meta-analyses that involved more than 10 studies []. P<.05 was set as statistically significant.
Quality and Risk of Bias
The Cochrane risk of bias tool (ROB 2) was used to assess the risk of bias in the included RCTs. This assessment tool evaluates 5 domains of potential bias: randomization process, deviations from the intended interventions, missing outcome data, measurement of the outcome, and selection of the reported result. For each domain, a trial can be categorized as having a low risk, some concerns, or a high risk of bias. For the overall risk-of-bias judgment, a trial was deemed to have a low risk of bias only if all domains were rated as low risk. Conversely, any trial was judged to have a high risk of bias if it scored high in any domain. We used GRADEpro GDT software (Evidence Prime, Inc) to evaluate the quality of evidence from meta-analyses, which could be reduced based on 5 key factors: risk of bias, inconsistency, indirectness, imprecision, and publication bias.
Results
Search Results
Searches of 8 databases identified 2495 unique citations (). After removing duplicates, we excluded 1113 records based on titles and abstract screening, resulting in 69 records for full-text review. We additionally included 3 eligible trials identified through reference lists of previous reviews and original studies. A total of 31 studies [-,,,-] met the inclusion criteria and were included in the systematic review for narrative synthesis. Among the 31 studies, 5 randomized trials [,,-] did not report sufficient data for calculating the pooled effect size; thus, 26 randomized trials were included for meta-analysis [,,,,-].
Figure 1. Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow chart. RCT: randomized controlled trial.
Results of Systematic Review
A total of 29,637 participants from 18 countries and regions were involved in 31 studies [-,,,-], recruited from clinical settings (n=4), community (n=10), online (n=10), and mixed settings (n=7). The majority (n=19) had sample sizes under 200 adolescents and young adults. Most were single-site studies, with 10 [-,,,,,,,] conducted in the United States, 5 in China [,,,,], and only one [] multisite study conducted in Switzerland, Germany, and Austria. Among the 31 studies, 12 involved nonclinical populations [,,-,,-,], 11 included participants with health problems via self-report or screening (eg, anxiety, depression, or substance use) [,,,,,-,,], and 8 studies involved clinical samples with diagnosed mental or physical health issues [,,,,,,,]. Eighteen studies explicitly demonstrated their research focus on adolescents and young adults [,,,,-,,,,,,], one of which focused on young cancer survivors [], and 4 studies exclusively supported women with specific circumstances, such as intimate partner violence, pregnancy, and childbirth [,,,]. Intervention duration varied considerably, from several minutes to 4 months, with 15 studies conducting additional follow-up surveys from 2 weeks to 6 months [,,,-,-,,-,,]. Table S2 in presents the characteristics of studies included in this review.
We extracted data on the characteristics of the chatbot intervention and their technical design features (Table S3 in ). These chatbots were most commonly designed to improve depressive and anxiety symptoms, which were assessed in 20 [,,,,,,-,-,,,-] and 19 studies [,,,,,-,-,,,-], respectively, followed by 7 studies targeting stress management [,,,,,,]. Specifically, several studies delivered psychotherapy or behavior support for people who experienced substance use and addiction (n=4) [,,,], self-ambivalence and appearance distress (n=3) [,,], attention-deficit or hyperactivity disorder (ADHD) (n=2) [,], sleep disorder (n=2) [,], relationship and social activity problems (n=2) [,], and eating disorder (n=1) []. Cognitive behavioral therapy was the most common therapeutic approach (n=21) [,,,,-,-,,,], followed by mindfulness-based therapy (n=9) [,,,-,,,], motivational interviewing (MI) (n=5) [,,,,], stress coping (n=4) [,,,], acceptance and commitment therapy (n=3) [,,], interpersonal psychotherapy (n=3) [,,], dialectical behavior therapy (n=3) [,,], positive psychology (n=2) [,], and emotion-focused therapy (n=2) [,]. In addition to the core treatment, other notable design features included empathic responses, customization, mood tracking, reflection, accountability, goal-setting, mascot or static avatars, gamified interaction, and problem-solving. Seven studies were tailored to address key challenges unique to adolescents and young adults, such as academic work management, life transitions, relationships [-], body image concerns [,], and self-esteem issues [,], which were particularly salient during this developmental stage.
Regarding the design characteristics of chatbots, instant messenger platforms (ie, Facebook [Meta Platforms], WeChat [Tencent Holdings Limited]) and standalone smartphone apps emerged as the most popular platforms for delivering chatbot services, featured in 15 [-,,,,,-,,,,] and 13 studies [,-,-,,,,,,], respectively. The remaining 3 studies deployed the chatbots on websites [,,]. Most of the chatbots provided periodical pop-up notifications to remind users to interact with chatbots (n=22). 21 studies integrated auditory or visual content based on text-based generation [,,,,-,,,,,,,]. Eighteen studies incorporated safety measures in chatbots, such as access to human professionals, a crisis hotline, suicidal ideation monitoring, and referral to local resources [,,,,,,-,-,,-]. The majority of chatbots (n=18) used a rule-based approach to interact with users [,,-,-,,,,,,-], while 10 studies used a retrieval-based system [-,,,,,,,]. Only 3 studies explored generative approaches for chatbot development, using Bidirectional Encoder Representation from Transformers (BERT) and GPT to create real-time responses [,,], and one study used GPT-3.5 to refine the chatbot following its pilot testing phase []. In terms of AI techniques, NLP was used in most studies (n=12) to analyze user intent and context, facilitating the selection of appropriate responses [,,,,,,,,,,,]. Additionally, some reports integrated other methodologies, including machine learning (n=7) [-,,,,], natural language understanding (n=5) [,,,,], and deep learning (n=3) [,,], to enhance the chatbots’ learning capacity and contextual comprehension.
Usage data and user engagement with chatbots were tracked in 23 studies through various metrics, including the frequency of interactions or exchanged messages (n=11) [,,,,,,,,,,], the number of engaged sessions or completion rates (n=9) [,,,,,,,,], the length of conversations (n=7) [,,,,,,], the number of active days (n=6) [,,,,,], the number of check-ins (n=3) [,,], and the time period for peak use (n=1) []. More than half of the studies (n=17) reported higher than 20% attrition in the intervention group [,,,,,-,,,,,,,,,]. Two studies analyzed the change in performance of user engagement over a time period [,]. Additionally, 24 studies explored user experiences, using metrics such as satisfaction (n=8) [,,,,,,,], helpfulness (n=5) [,,,,], working alliance (n=5) [,,,,], and acceptability (n=4) [,,,]. Open-ended user feedback was documented in 14 studies [-,,,,,,,,,,,], providing valuable insights into both the strengths and limitations of chatbot interactions. On the positive side, chatbots were frequently praised as effective tools for promoting understanding and awareness of health topics through structured exercises and detailed explanations (n=6) [,,,,,]. Users valued chatbots for their empathy, emotional support, and ability to foster a sense of being heard (n=6) [,,,,,]. Personalization and ease of access were commonly highlighted (n=4) [,,,] with chatbots regarded as a convenient alternative to traditional therapy []. Features such as reminders, weekly summaries, and visually engaging elements like emojis, avatars, and interactive interfaces enhanced the user experience, contributing to adherence and helping users stay on track with their health goals (n=3) [,,]. However, notable challenges were also identified, with repetitive and rigid interactions emerging as a major concern (n=10) [,,,,,,,,,]. Users expressed frustration over the inability of chatbots to handle open-ended or unexpected responses (n=6) [-,,,], and some conversations were criticized for being overly general or lacking depth and clarity (n=5) [,,,,]. Technical issues, such as glitches, looping conversations, and slow operations, were frequently reported (n=7) [,,,,,,], disrupting the interaction flow and significantly diminishing overall usability.
Of the 31 studies, only one study reported mediators between chatbot interventions and outcomes, in which visceral anxiety, catastrophic thinking, and fear of food were observed to be significant mediators between chatbot use and gastrointestinal symptom severity (P<.001) and quality of life (P<.001) []. For moderators, one study revealed significant interaction effects of group by ethnicity and by writing behaviors for social activity, stress, and life satisfaction []. Two studies noted that people with more severe baseline physical and mental health symptoms experienced more pronounced benefits of chatbots [,]. Four studies probed the moderating role of user engagement. Specifically, the frequency or the number of times of interaction with the chatbot was positively correlated with the reduction in ADHD symptoms (P=.03) [] and loneliness (P<.006) []. The dosage, measured as engaged sessions, was correlated with improvement in anxiety (P=.06) [], and depression (P=.08), quality of life (P=.07) []. Another study revealed that the reported commitment to change behavior significantly increased with time (P<.001), suggesting higher commitment toward the end of the intervention than in the middle or at the start [].
Results of Meta-Analysis
Overall Mental Distress
A total of 21 studies, comprising 2813 participants in the experimental groups and 3116 in the control groups, were included in the meta-analysis for the overall mental distress. Among these, indicators for anxiety (n=18) [,,,,-,-,,,-] and depression (n=17) [,,,,-,-,,-] were most commonly examined, and the remaining assessments included somatic symptoms (n=3) [,,], sleep disorders (n=2) [,], ADHD (n=2) [,], substance use disorders (n=2) [,], and eating disorders (n=1) []. Compared to control conditions, participants interacting with chatbots exhibited significantly greater reductions in the overall mental distress, with an effect size of SMD −0.35 (95% CI −0.46 to −0.24; P<.001) (). The “leave-one-out” sensitivity analysis demonstrated the robustness of the findings, with estimated effect sizes ranging from −0.30 to −0.36 (Figure S11 in ). The results of the funnel plot and Egger test revealed potential publication bias (P=.01), while no additional studies were imputed with the Trim-and-Fill approach and the adjusted effect size (SMD −0.372, 95% CI −0.529 to −0.216) was identical to the observed value, suggesting a negligible impact on the conclusions. The subgroup analyses revealed 4 significant moderators. Studies that targeted subclinical and clinical samples produced larger effect sizes than those for nonclinical populations (P=.003). Chatbots deployed as standalone apps were significantly more effective than those delivered via instant messenger or websites (P=.03). Among different chatbot architectures, generative chatbots demonstrated the largest effect size, followed by retrieval-based and rule-based systems (P=.007). Interestingly, studies comparing chatbots to active control did not show significant group differences, and their pooled effect was significantly lower than those comparing chatbots to information and passive controls (P=.02). The detailed results of subgroup analysis are presented in Table S4 in .
Figure 2. Forest plot for the effects of chatbots on overall mental distress. [,,,,,-,-,,-]
Depression
The pooled effect size for the 17 postintervention comparisons between chatbots and various control conditions on depression was (SMD −0.43, 95% CI: −0.62 to −0.23; P<.001), with high heterogeneity (P<.001; I2=81%) (Figure S1 in ). The sensitivity analysis demonstrated the robustness of the findings, with estimated effect sizes ranging from −0.34 to −0.47 (Figure S11 in ). The results of the funnel plot and Egger test revealed potential publication bias (P=.02), while no additional studies were imputed with the Trim-and-Fill approach and the adjusted effect size (SMD −0.44, 95% CI −0.66 to −0.21) was identical to the observed value, suggesting a negligible impact on the conclusions. Subgroup analyses revealed a significant difference between dialog system methods (P=.03). Specifically, retrieval-based chatbots demonstrated the strongest and most reliable effect, followed by rule-based chatbots with a smaller but significant effect (P<.001). Generative chatbots, while showing a potentially large effect, exhibited a wide CI and failed to reach statistical significance (Table S4 in ).
Anxiety
A total of 18 studies were included for the effects on anxiety [,,,,-,-,,,-]. Compared to the control groups, participants interacting with chatbots exhibited a significantly greater reduction in anxiety, with an effect size of SMD −0.37 (95% CI −0.58 to −0.17; P<.001) ( Figure S2 in ). The heterogeneity was considerably high across included trials (P<.001; I2=87%). The sensitivity analysis revealed a stable pooled effect size ranging from −0.35 to −0.41 and remaining statistically significant when an influential study was excluded [] (Figure S11 in ). There is no significant publication bias as supported by the funnel plot and Egger test (P=.18). The subgroup analyses highlighted significant differences in chatbot effectiveness between deployment formats (P=.05). Specifically, standalone chatbots produced higher between-group effects on anxiety compared to those delivered via instant messenger or website (Table S4 in ).
Positive Affect
There is no statistically significant effect of chatbot interventions observed on positive affect compared to controls (SMD 0.03, 95% CI: −0.15 to 0.21; P=.73), with substantial heterogeneity across 11 studies (P=.002; I²=63%) (Figure S3 in ). The pooled effect sizes remained relatively stable with confidence intervals consistently crossing the null value after sequentially omitting each study (Figure S11 in ). The funnel plot showed a symmetrical pattern with data points scattered evenly around the pooled effect size, suggesting the absence of marked small-study effects, which was further confirmed by the Egger test (P=.55).
Negative Affect
A small but statistically significant decrease in negative affect among participants who used chatbots compared to controls (SMD −0.27, 95% CI=−0.53 to −0.01; P=.04) was observed among 11 studies (Figure S4 in ). All estimated effect sizes yielded from sensitivity analysis consistently fell within the 95% CI, ranging from −0.26 to −0.31 (Figure S11 in ). The heterogeneity significantly decreased from an I2 value of 83% (P<.001) to 0% (P=.84) when we excluded the study by Romanovskyi et al [], though the overall effect remained significant. The funnel plot was visually symmetrical, and the Egger test for small-study effects did not detect significant publication bias (P=.39).
Stress
Participants engaging with chatbots demonstrated a significantly greater reduction in stress compared to various control conditions, with a moderate effect size (SMD −0.41, 95% CI: −0.50 to −0.31; P<.001) (Figure S5 in ). No heterogeneity (I2=0%; P=.54) was observed across 6 included studies, indicating that the effects of chatbots on stress were consistent and generalizable across studies with differing characteristics. The sensitivity analysis further confirmed the robustness of the findings, with estimated effect sizes ranging from −0.40 to −0.56 (Figure S11 in ). Specifically, when we excluded the study by Haug et al [], a slightly larger effect size estimate (SMD −0.56, 95% CI −0.76 to −0.36) was observed. This deviation may be attributed to the inappropriate use of a single-item measure for stress symptoms and a considerably larger sample size compared to other trials. Nevertheless, the overall effect remained statistically significant even when the influential study was excluded.
Psychosomatic Symptoms
Five studies assessed psychosomatic symptoms influenced by chatbot interventions, resulting in a significantly larger reduction in various symptoms compared to control groups (SMD −0.48, 95% CI −0.82 to −0.14; P=.006) (Figure 6 in ). The sensitivity analysis indicated the robustness of the findings, with estimated effect sizes ranging from −0.36 to −0.49 (Figure S11 in ). The heterogeneity among included studies was considerable (P=.002; I²=76%), but significantly decreased (P=.20; I²=35%) after we excluded the study by Sabour et al [] while the overall effect remained the same direction and significance. Subgroup analyses revealed three significant moderators. Specifically, studies that targeted clinical samples showed a greater decrease in psychosomatic symptoms than those focusing on subclinical and nonclinical samples (P=.008). Chatbots deployed as standalone apps yielded significantly greater effects than web-based platforms (P=.002). Additionally, retrieval-based systems showed the largest effects, outperforming both generative and rule-based chatbots (P=.001) (Table S4 in ). However, these results should be interpreted with caution due to the limited number of studies available for each subgroup.
Self-Ambivalence and Appearance Distress
Four distinct measures targeted negative self-relevant thoughts and body image were included for evaluating the influence of various interventions on self-ambivalence and appearance distress in this analysis. A significant positive effect favoring chatbots was observed compared to passive control groups (SMD -0.25, 95% CI −0.34 to −0.17; P<.001), with moderate heterogeneity across studies (P=.19; I²=38%) (Figure S7 in ). The pooled estimates remained statistically significant, with the overall effect size ranging from −0.20 to −0.31 and within comparable confidence intervals (Figure S11 in ).
Life Satisfaction and Well-Being
Ten relevant outcomes from 7 separate trials were meta-analyzed for the overall life satisfaction and well-being. A significantly greater improvement for participants in the chatbot groups was observed than those in controls (SMD 0.12, 95% CI 0.03-0.21; P=.01), with moderate heterogeneity detected across 7 trials (P=.06; I²=44%) (Figure S8 in ). The sensitivity analysis suggested the robustness of the findings, with the overall effect sizes ranging from 0.07 to 0.13 ( Figure S11 in ). However, when we excluded two influential studies [,], the 95% CI crossed the null value, while the direction maintained the same. The absence of publication bias was evidenced by the funnel plot and Egger test (P=.76). Subgroup analyses revealed a significant difference in effects between dialog systems (P=.04) (Table S4 in ). Moreover, meta-regression analysis revealed statistical effects of gender (P=.02) on the pooled effect size (Figure S12 in ).
Self-Efficacy
Six trials were included in the meta-analysis to evaluate the pooled effect of chatbot interventions on self-efficacy outcomes, resulting in a positive trend effect favoring the experimental group but no statistically significant difference obtained (SMD 0.14, 95% CI −0.14 to 0.41; P=.33) (Figure S9 in ). Considerably high heterogeneity was observed across the included studies (P<.01; I²=86%), which may be attributed to differences in specific measurement targets, encompassing general self-efficacy, self-efficacy in addressing body image concerns, and confidence in self-management for health and well-being. The results of the sensitivity analysis showed that the overall effect remained stable, with SMD estimates ranging from 0.10 to 0.26, and the pooled effect remaining statistically nonsignificant when individual studies were excluded (Figure S11 in ).
Health Behavior Change
Nine health behavior outcomes from 6 separate trials were included for the meta-analysis, revealing a statistically significant effect in favor of chatbot interventions (SMD 0.11, 95% CI 0.03-0.19; P=.006) (). Moderate heterogeneity among studies was observed among studies (P=.06; I²=46%), potentially attributed to the wide spectrum of health behaviors we targeted. Sensitive analyses demonstrated the robustness of this result, with estimates ranging from 0.09 to 0.14 (Figure S12 in ). Notably, the omission of 2 specific outcomes [,] resulted in a slight increase in the combined effect size and significantly decreased the heterogeneity. The symmetric funnel plot and Egger test (P=.43) indicated a low likelihood of publication bias. Studies designed with active controls produced less between-group effects than those compared to a passive control group (P=.02). Additionally, chatbots that sent check-in reminders produced more positive effects on changing behaviors than those that did not (P=.02) ( Table S4 in ).
Figure 3. Forest plot for the effects of chatbots on health behavior change. [,,,,,]
Quality and Risk of Bias
The interrater reliability, as measured by Cohen kappa, ranged from 0.471 to 0.523 across 5 domains of the Cochrane ROB 2 tool, indicating moderate agreement between the raters. For any discrepancies identified between raters, discussions were held to achieve consensus; if consensus could not be reached, a third reviewer was consulted to make the final decision. The overall risk of bias was rated as high for 25 studies (Figure S13 in ). The majority of studies (26/31) demonstrated appropriate randomization procedures and were rated as low risk in the domain of randomization process. However, 5 studies raised concerns due to insufficient reporting on the random allocation approach or observed imbalances in baseline characteristics between groups. For the domain of deviation from the intended interventions, no studies exhibited significant deviations from the intended interventions, though neither participants nor those delivering the interventions could be blinded due to the nature of the intervention. 19 studies adhered to the ITT principle. However, 8 studies were judged to raise some concerns in this domain due to the absence of appropriate analyses to estimate the effect of assignment to the intervention. Additionally, 7 studies were rated as high risk because a substantial proportion of participants were excluded from the analyses, which could have significantly impacted the validity of the results. 12 studies were judged to have a low risk in the domain of missing outcome data, while 14 were rated as high risk due to imbalanced drop-out rates between groups and lack of evidence that appropriate methods were used to address the potential bias introduced by high attrition. The primary reason for the notable source of bias arising from the measurement of the outcome was the reliance on self-reported outcomes as the preferred method in most trials, where 16 studies were rated as high risk because self-reported measures are inherently prone to biases, and the strong level of belief in the beneficial effects of the intervention could influence outcome assessments. In the selection of the reported result domain, 12 studies raised some concerns due to the unavailability of their protocols or trial registrations, or minor discrepancies between the planned and reported outcome measurements. Furthermore, 2 studies were judged to have a high risk as their reported results were likely selected from multiple eligible measures or analyses, raising concerns about selective reporting. The quality of evidence, evaluated using the GRADE approach, was rated as very low to low, possibly due to the overall high risk of bias or substantial heterogeneity across the majority of studies (Table S5 in ).
Discussion
Principal Findings
In this systematic review and meta-analysis, we synthesized evidence on the effectiveness of chatbots for adolescents and young adults and found overall significant positive effects in alleviating mental distress and promoting health behavior change. The most pronounced effects were observed in studies that compared chatbot interventions to information controls, used standalone mobile apps for deployment, used generative or retrieval-based chatbots, or targeted individuals in subclinical and clinical groups. Additionally, chatbots with reminders that encourage users to engage in interactions have been more effective in promoting behavior change. Moreover, user engagement was a significant moderator influencing chatbot effectiveness, while repetitiveness and inflexibility of content emerged as the most common barriers to retain chatbot adherence. Despite the proposed advantages of chatbots as accessible, cost-effective treatment alternatives, none of the studies included in this review conducted cost-effectiveness analyses or focused on low-resource settings.
Across the included studies, chatbots consistently demonstrated small-to-moderate effects in reducing symptoms of depression, anxiety, negative affect, stress, and psychosomatic problems among adolescents and young adults. These findings reinforce prior evidence, underscoring the promise of chatbots as scalable and accessible tools to address specific mental health challenges in this population []. Notably, retrieval-based chatbots demonstrated a consistent moderate effect in reducing depressive and psychosomatic symptoms, suggesting that the structured and evidence-based design may offer a more reliable and effective approach to delivering mental health support. In contrast, the comparatively modest effects observed with rule-based chatbots may stem from their inherent limitations in flexibility and reliance on predefined scripts. While rule-based systems can be effective in specific scenarios, their rigid architecture often restricts their ability to adapt to the diverse and dynamic needs of individuals with mental health problems. Generative chatbots, despite showing the strongest effects for overall mental distress, did not demonstrate consistent effects for specific mental health problems, which may be attributed to the limited available evidence. This uncertainty highlights the need for further research to better understand the potential and the limitations of generative chatbots applied in this context. Additionally, our analysis indicated that chatbots were more effective for psychosomatic symptoms in clinical populations compared to nonclinical groups, which aligns with the notable trend across studies that individuals with more severe baseline symptoms tended to derive greater benefits from interventions [,]. Moreover, the larger effect size observed for standalone chatbots in alleviating anxiety, compared to web-based ones, indicates that the deployment format may play a crucial role in influencing the effectiveness of chatbots. This may be attributed to the personalized and engaging design of the independent system, allowing for a more focused therapeutic engagement with less interruption, as opposed to chatbots integrated into instant messenger apps or websites that may cause more distractions. In addition, our review is among the first to provide valuable evidence supporting the effectiveness of chatbots in reducing self-ambivalence and appearance distress. While the effect size was modest, this finding is particularly significant for adolescents and young adults, who frequently grapple with issues related to identity, self-esteem, and body image. This highlights the potential of chatbots to address sensitive and deeply personal concerns that individuals may find difficult or shameful to discuss with human professionals. The ability of chatbots to offer a nonjudgmental and accessible platform for support is crucial in this context. However, it is important to note that this synthesized result was derived from four different measures, requiring the need for further research to explore subgroup analyses to provide deeper insights into the specific contexts and conditions under which chatbots are most effective.
A significant but small effect was observed for life satisfaction and well-being, while no statistically significant improvement was noted for positive affect and self-efficacy. These findings align with the result of a previous review [], which reported limited impacts of conversational agents on fostering positive psychological well-being. This phenomenon may reflect a ceiling effect in certain populations or could be attributed to the primary focus of most therapeutic strategies, which tend to prioritize addressing mental health problems over promoting well-being, resilience, and recovery. This underscores the need for future chatbot designs that incorporate elements based on positive psychology skills, such as acknowledgment of positive events, personal strengths, and gratitude exercises. Moreover, such positive states may require longer-term or more intensive therapeutic sessions to yield measurable improvements. However, insufficient follow-up data for these outcomes can be accessed for validating our assumptions. Furthermore, our findings revealed that studies with a higher proportion of women reported greater improvements in overall well-being. This draws new attention to the possibility that the effectiveness of chatbots may be influenced by gender-related factors, such as differences in communication styles or help-seeking behaviors, with women potentially being more inclined to seek support for mental health issues or to engage in emotional disclosure that may align more closely with the empathetic design of many chatbots []. However, it is notable that no study in our review explicitly examined gender differences in user engagement or interaction patterns with chatbots. Two studies [,] used Linguistic Inquiry and Word Count (LIWC) to analyze participants’ response transcripts. While indicating a potential relationship between word use frequency and mental well-being, these studies did not identify gender-based differences in expression characteristics. Further research is warranted to explore whether women exhibit stronger adherence to chatbots, or different interaction styles (ie, use of reflective language), and whether these factors serve as mechanisms for boosting therapeutic outcomes.
The effectiveness of chatbots in health behavior changes, though significant, remains relatively small, which aligns with a previous review []. Several factors may account for this observation. First, the limited statistical power resulting from the small number of trials (n=5) included may have constrained the ability to detect larger effects. The use of chatbots to encourage physical activities and healthy lifestyles within adolescents and young adults is markedly underreported, remaining a vast scope for further research to evaluate their impact on promoting sustained behavior change. Second, the reliance on self-reported measures introduces inherent biases and inaccuracies, which may compromise the validity of the observed findings. To address this issue, incorporating objective data collection methods, such as wearable devices or biological markers, could enhance the precision and reliability of outcome measurements and provide more robust evidence for behavior change. Third, differences in the theoretical underpinnings used across studies to drive behavioral change could have elicited diverse responses to chatbot interventions. However, due to the small number of original studies included, we are unable to further disentangle these nuanced effects on specific types of health behaviors. Moreover, our analysis revealed that studies using active controls reported smaller effects for chatbots compared to those using passive controls. This suggests that while chatbots may offer unique advantages, their incremental value may be less pronounced when benchmarked against well-established interventions. It is imperative for forthcoming studies to determine whether the chatbot interventions yield greater benefits when integrated as complementary tools rather than being standalone. In addition, regular check-in reminders from chatbots may serve as effective cues to action, reinforcing user engagement and adherence to desired behaviors. Further research is warranted to explore the extent to which the frequency and timing of reminders impact their efficacy.
The diversity in chatbot evaluation methods suggests a critical gap and calls for exploratory research to develop professionally validated instruments for assessing chatbot accuracy, safety, and user experience. The notable attrition rates observed in both groups, coupled with unsatisfactory completion of chatbot sessions, underscore the pressing need to optimize future research design to enhance user engagement and facilitate a more positive experience. To this end, it is imperative to involve adolescents and young adult participants in the chatbot design process, such as surveys, interviews, and user testing, ensuring that the intervention features align with their preferences, expectations, and behavioral patterns []. Additionally, optimizing the chatbot’s performance and designing a clear, user-friendly conversational interface are crucial to ensuring a satisfying user experience that promotes sustained engagement. Moreover, generative AI systems present significant opportunities in this regard, with the potential to achieve more flexibility, deeper contextual understanding, and superior response quality, which have demonstrated remarkable user engagement globally []. Notably, generative AI chatbots can respond adaptively to unexpected user inputs, even those not previously encountered, and avoid repetitive responses to varied queries, fostering more human-like dialogs that enhance users’ sense of being understood and empathized with. Despite these advancements, the application of chatbots in the domains of psychological and physical health remains cautious. Most therapeutic chatbots currently rely on rule-based or retrieval-based designs. This limitation is primarily due to concerns about the insecurity, potential biases, and “hallucination” of AI-generated content when addressing sensitive issues, which could lead to unintended negative consequences []. The “black box” nature of deep learning algorithms makes it impossible to predict conversational trajectories in advance []. Retrieval-augmented generation (RAG) offers a promising solution by connecting generative models with real-time information retrieval from external knowledge bases. This approach facilitates secure incorporation of up-to-date information and sensitive data while reducing the likelihood of hallucination and improving the accuracy through context grounding []. Graph-based RAG (GraphRAG) demonstrates significant potential for extracting holistic insights from lengthy documents by structuring RAG data into graphs. This enhances the capabilities of large language models to produce evidence-based medical responses, thereby increasing safety and reliability when managing private medical data []. Given the unique risks faced by adolescents and young adults, such as disclosure of self-harm intent to chatbots, or the reinforcement of harmful thought patterns by algorithms, it is crucial that research efforts should prioritize the establishment of clear safety protocols and robust evaluation frameworks to ensure their ethical and responsible deployment [].
Limitations
While our findings break new ground in exploring the influence of chatbot dynamics on holistic psychosocial well-being, specifically within adolescents and young adult populations, the conclusions are somewhat constrained by several limitations. First, the inclusion of studies with populations that were not exclusively adolescents and young adults but had a mean age within an eligible age range, though necessary to ensure comprehensive coverage of relevant evidence, may have introduced potential variability in contextual factors that may compromise the findings. Second, although the incorporation of diverse participant demographics enhances the ecological validity of the results, the lack of strict clinical thresholds for mental distress at baseline in some studies may dilute the observed intervention effects for clinically significant cohorts. Third, while examining a broad array of outcomes provides valuable insights into the potential of chatbots in health care, the variation in measurement instruments across studies for the same outcomes, as well as the combination of different health behaviors into a single aggregated outcome, may introduce substantial heterogeneity and obscure important distinctions between specific behaviors. Furthermore, due to the limited number of studies with follow-up data on the same outcomes and the wide variability in follow-up durations, it was not feasible to conduct a meta-analysis assessing sustained impacts. Crucially, the majority of included studies were assessed as having a high risk of bias, which may result in misestimation of effect sizes. Consequently, the certainty of evidence for most outcomes was rated as very low to low, substantially restricting both the generalizability and reliability of the observed effects. Moreover, while the adjusted effect sizes for overall mental distress and depressive outcomes appear robust to publication bias, the potential for unpublished negative or inconclusive studies suggests that the true effect of AI chatbots may be smaller than reported. Therefore, the conclusions drawn from this review should be interpreted with considerable caution. Finally, despite the rapid proliferation of generative AI, this review underscores a critical gap in empirical research evaluating their specific impacts among adolescents and young adult populations, which also hindered our ability to provide evidence on the effects of the specific mechanisms of generative models on therapeutic outcomes. The clinical effectiveness of generative AI chatbots in mental and behavioral health remains unknown. Future studies are expected to implement large-scale, long-term interventions with rigorous designs to fully understand the benefits and advantages of chatbots integrated with generative systems.
Conclusions
This study provides evidence supporting the overall effectiveness of chatbots in alleviating mental distress and promoting positive health behaviors among adolescents and young adults. The effectiveness of chatbots varied across different target samples and control conditions, and three key design features were identified as significant moderators of chatbot efficacy: dialog system methods, deployment format, and the use of reminders. Among the dialog systems, retrieval chatbots demonstrated the most consistent and reliable effects, while generative AI chatbots showed potential but exhibited variability in their effectiveness. Given the growing use of generative AI, it is crucial to establish robust safety protocols and evaluation frameworks before their implementation in real-world settings. Future research should focus on validating the long-term effects and consistency of generative AI chatbots while exploring their broader applications in mental health and behavioral interventions for adolescents and young adults.
The authors would like to thank Shaowei Guan and John Law for their expert insights and guidance on the identifications of key chatbot design features.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The research was conducted in the JC STEM Lab of Digital Oncology Care Enhancement (DOCE) funded by The Hong Kong Jockey Club Charities Trust.
The datasets analyzed during this study are available from the corresponding author on reasonable request.
Edited by Amy Schwartz; submitted 30.Jun.2025; peer-reviewed by Kimberly Kaphingst, KittisaK Jermsittiparsert; final revised version received 24.Sep.2025; accepted 16.Oct.2025; published 26.Nov.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
The global proliferation of digital health technologies (DHTs), ranging from telemedicine to artificial intelligence (AI)-driven diagnostics, has reshaped health care delivery []. These innovations offer significant potential to address global health system challenges by improving service coverage, health care efficiency, and the quality of health care practices and services [,]. Within this global context, China has actively promoted DHT adoption through its “Healthy China 2030” initiative, which specifically aims to develop interoperable health data platforms, facilitate cross-sector medical collaboration, and reduce urban-rural health care disparities []. However, despite these advancements, the adoption and usage of DHTs among physicians remain uneven, influenced by a complex interplay of factors []. At the organizational level, existing research has established that institutional support systems (eg, training and technical assistance) and conducive regulatory environments are critical contextual facilitators of DHT adoption []. Conversely, growing evidence underscores that individual cognitive factors may be even more pivotal in shaping physicians’ decisions—such as perceived usefulness and ease of use, self-efficacy in using DHTs, and deeply held mental models about clinical workflows. Nevertheless, the field lacks robust evidence to explain how these cognitive mechanisms account for the substantial variations observed in physicians’ DHT adoption patterns, particularly across different clinical contexts and implementation stages [,]. These variations appear to originate from both methodological differences in how studies measure technology acceptance and unaddressed heterogeneity among physician populations, particularly across different medical specialties and practice settings. This study addresses this gap by applying latent profile analysis (LPA) to identify distinct subgroups of physicians based on their personal evaluations of DHT adoption. Given the central role of physicians in the digital transformation of health care, understanding their perspectives is essential for ensuring the successful implementation and widespread adoption of these technologies.
DHT Adoption Landscape
The term “digital health,” which evolved from “eHealth,” refers to the application of information and communication technologies to support health care and health-related fields. More recently, “digital health” has been introduced as a broader concept encompassing eHealth (including mobile health) and emerging fields such as the application of advanced computing sciences in data, genomics, and AI []. The adoption of DHT services to support patient care has grown significantly in health care institutions worldwide. Driven by the increasing prevalence of mobile phones and the widespread availability of preventive health and fitness applications, DHT and eHealth are playing an increasingly important role in enhancing medical workflows []. However, while digital health solutions are increasingly popular with the public, implementation faces hurdles in clinical settings. A central challenge is the lack of systematic frameworks to rigorously evaluate both benefits and risks. This evaluation gap contributes to professional hesitancy among health care providers and institutions, limiting user engagement and contributing to differences in technology uptake across care settings []. Recent literature confirms that DHT adoption rates exhibit significant variation across different service types, clinical specialties, and patient subgroups []. Moreover, the underusage of DHT poses considerable difficulties for modern health care systems. Hospitals experience decreased operational efficiency, reduced care quality, and financial strain due to factors such as patient attrition and restricted insurance reimbursements []. In turn, patients’ limited access to DHT may lead to suboptimal care, including extended waiting times, which further widens existing health disparities []. Therefore, effectively addressing these DHT adoption challenges is essential for promoting sustainable, equitable, and patient-centered health care delivery in the future.
Determinants of Uneven DHT Adoption
The heterogeneous adoption patterns of DHTs stem from a dynamic interaction between enabling factors and systemic barriers. When DHTs demonstrate measurable clinical effectiveness, health care providers are more likely to recognize their potential for enhancing work efficiency and patient outcomes, thereby developing favorable attitudes toward technology adoption. This positive perception creates a virtuous cycle that may ultimately improve clinical performance []. Conversely, inadequate integration of DHTs with existing clinical workflows often generates resistance among health care professionals, potentially undermining implementation efforts [].
Current evidence frames DHT adoption through a tripartite model integrating: (1) individual factors (eg, perceived utility vs digital literacy gaps); (2) organizational and environmental factors (eg, supportive policies vs financial constraints); and (3) technological factors (eg, interoperability vs security risks) []. Among physicians, adoption barriers are particularly multifaceted, spanning cognitive (eg, technophobia), attitudinal (eg, skepticism toward clinical efficacy), and experiential domains (eg, limited previous exposure). Resistance often stems from perceived workflow disruptions, eroded patient-provider dynamics, or mismatches between technology design and clinical needs. Conversely, demonstrable efficiency gains, user-friendly interfaces, and alignment with professional norms foster acceptance. Critically, adoption patterns reflect an interplay of these dimensions; for instance, even robust technology may fail if organizational support (eg, training) is lacking [,]. Tailored strategies addressing domain-specific barriers (eg, pilot programs for technophobic clinicians and interoperable tools for fragmented systems) are essential to bridge gaps between policy goals and real-world implementation [].
The Unified Theory of Acceptance and Use of Technology 2 (UTAUT 2) has been effectively applied across international contexts, including Germany and the United States, to examine DHT adoption. Studies based on this framework, which often incorporate constructs such as perceived security and relative advantage and use age-stratified sampling, consistently identify performance expectancy and hedonic motivation as key drivers of usage intention. These studies also highlight security concerns as a major barrier []. Further research on German mobile health apps revealed the predominant influence of hedonic motivation over utilitarian factors, with contextual variations observed between lifestyle and therapeutic apps []. Collectively, these findings underscore the adaptability of UTAUT 2 across diverse health care technologies and cultural settings, particularly when incorporating domain-specific variables. However, research based on UTAUT 2 remains largely confined to conventional methods such as subgroup analyses and clustering approaches, which rely on variable-centered techniques such as moderation analysis or predefined demographic comparisons. These methodological constraints may limit the ability to capture clinically meaningful, person-oriented adoption profiles []. Realizing the full generalizability of DHT adoption models requires not only careful consideration of user and provider heterogeneity, along with further validation across diverse populations, but also the adoption of more nuanced, person-centered analytical frameworks. A comprehensive understanding of physicians’ adoption behaviors demands a multidimensional perspective that simultaneously assesses perceptions of utility, risks, barriers, and usage intentions, ultimately moving beyond structural models toward person-centered approaches.
Despite physicians’ pivotal role as clinical decision-makers and primary end users of DHTs, current research predominantly centers on citizen [] and patient perspectives [,], or on technical feasibility [], leaving a significant gap regarding health care professionals’ perceptions and experiences. Few studies have specifically targeted the evaluation of the creation, implementation, long-term use, and self-reported barriers and facilitators to DHT use by health care professionals []. Moreover, the majority of existing studies, including those using established theoretical frameworks such as the technology acceptance model [] and the UTAUT model [], rely predominantly on variable-centered approaches. These approaches focus on the relationship between DHT or eHealth service implementation and various factors across the overall sample. From this perspective, most previous studies—including those using UTAUT 2—focus on aggregate relationships and isolated moderators, thereby overlooking systematic heterogeneity within physician populations. Such constraints ultimately diminish their capacity to explain actual usage patterns within complex health care environments. More critically, such variable-centered methods inherently assume population homogeneity and thus obscure meaningful heterogeneity across distinct user subgroups, leading to an inadequate characterization of clinically relevant adoption patterns and context-specific barriers. This gap is especially pronounced in the Chinese context, where rapid, policy-driven digital health transformation may have generated unique adoption profiles not captured by conventional approaches.
Study Rationale and Objectives
To address these limitations, this study introduces LPA as a novel, person-centered methodological framework for investigating physician adoption of DHTs. LPA is a probabilistic modeling technique that identifies naturally occurring subgroups within multidimensional data based on shared response patterns []. This method is particularly valuable for capturing heterogeneity and identifying nuanced profiles of technology acceptance that remain concealed in variable-level analyses [,]. In contrast to previous variable-centered studies, LPA enables (1) the identification of clinically meaningful subgroups characterized by distinct configurations of perceptions across benefits, barriers, and behavioral intentions; (2) the examination of multilevel predictors of subgroup membership; and (3) the development of tailored implementation strategies aligned with the specific needs of different physician populations. Given physicians’ pivotal role in health care’s digital transformation, these insights are critical for developing targeted interventions that move beyond one-size-fits-all adoption strategies to account for the nuanced needs and perceptions of different clinician subgroups [].
Therefore, this study is designed to achieve 2 key objectives. First, it aims to classify Chinese physicians’ DHT preferences using LPA to identify heterogeneous subgroups based on a 3D evaluation framework. Second, it seeks to investigate how demographic and occupational factors correlate with profile membership. By transcending aggregate-level insights, this approach offers a more nuanced and clinically relevant understanding of DHT adoption behaviors. As DHTs become increasingly prevalent, the findings are poised to inform tailored interventions that address implementation barriers, especially among hesitant health care professionals. Furthermore, this research provides actionable recommendations for policymakers, health authorities, medical institutions, and insurers to support the design of context-sensitive DHT adoption strategies that enhance physician engagement and ultimately improve health care delivery.
Methods
Study Design and Data Sources
With the approval of the Shaanxi Provincial Health Commission and authorization from the Xi’an Municipal Health Commission, we undertook a cross-sectional investigation across health care facilities in Xi’an, Shaanxi Province, China. This investigation, conducted from October 18 to December 23, 2023, was a crucial part of the “2023 Healthcare Worker Survey” and the broader 7th Xi’an Health Services Survey. The survey aimed to evaluate medical staff’s practice status, working conditions, and health to inform local health policy and management. It has also been used in previous studies on health care professionals’ well-being and occupational challenges []. This study used a cross-sectional survey design, conducted in accordance with the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines ( []).
We used random cluster sampling to select 46 hospitals (26 Level-II and 20 Level-III) from municipal and county-level medical institutions in Xi’an. Eligible participants included licensed physicians (including therapists and clinical practitioners) with full-time or contractual employment status in either public or private hospitals. To ensure sample homogeneity and mitigate potential selection bias, we restricted our sample to physicians affiliated with institutions that had formally implemented DHT programs. This inclusion criterion accounted for self-selection bias, given that physicians who had adopted DHT voluntarily before institutional rollout might have exhibited systematically more favorable attitudes toward DHT than the broader physician population (detailed information on the data resources is provided in ).
To ensure data quality, we conducted a pilot test with 814 health care workers (achieving 93.5% compliance) and trained liaison officers from 33 city-level hospitals and 9 county-level government departments on survey protocols, quality control, and tool usage. We implemented a range of data quality control measures, including consistency checks (eg, control questions 12 and 55), logic verification (eg, years of service), outlier detection (eg, age range), and completion time analysis (requiring >3 minutes for >90% completion). From an initial 8617 responses, 3766 were excluded due to incomplete data (n=283), invalid entries (n=97), excessively short completion times (n=46), or employment at institutions where the relevant DHT was not implemented or its status was unknown (n=3431). The remaining 4851 responses were included in the final analysis (detailed Missing Completely At Random test results are provided in Section 2, ).
Demographic and Occupational Characteristics of Participants
Drawing on previous literature regarding barriers and facilitators of DHT adoption, which highlights the association between certain sociodemographic and occupational characteristics (eg, age, gender, professional title, and years of experience) [] and DHT adoption, we included similar indicators in our analysis to examine their association with profile membership. Specifically, the sociodemographic and occupational factors assessed in this study comprised: (1) sociodemographic factors such as gender, age, educational attainment (Bachelor’s, Master’s, or PhD), annual income level (stratified by tertiles), and self-rated health status (5-point Likert scale: 1=very poor to 5=excellent); and (2) occupational variables such as hospital grade (Level-II [secondary] vs Level-III [tertiary], professional title [resident, attending, or chief physician]), years of clinical experience, weekly working hours, monthly night shift frequency, as well as psychosocial measures including work satisfaction (assessed using a 10-item scale), occupational stress (4-item scale), and doctor-patient relationship quality (3-item scale).
Doctor-Patient Relationship Quality Scale
Physicians’ perceptions of the doctor-patient relationship were measured using the DPRQ-3 (Doctor-Patient Relationship Questionnaire-3), a simple and easy-to-use questionnaire designed for assessing the doctor-patient relationship in medical settings, and served as the primary independent variable []. This 3-item scale includes questions such as: “How do you feel patients respect the doctor?”, “To what extent do you believe society respects the doctor profession?”, and “What do you think of the current doctor-patient relationship?”. Participants answered each item using a 5-point Likert scale (1=very disrespectful or very bad to 5=very respectful or very good). In this paper, the Cronbach α coefficient of this scale was 0.82.
Occupational Stress Scale
In this study, occupational stress is defined as the stressful aspects of clinical work encountered by physicians in their professional environment. The occupational stress scale was adapted from existing instruments to measure the psychological distress perceived by medical staff while performing their duties [,]. Participants responded to 4 items on a 6-point Likert scale ranging from 1 (strongly disagree) to 6 (strongly agree). These items included: “Overall, I feel great pressure at work,” “I feel a high level of tension at work,” “I’m having trouble sleeping because of work,” and “I’m nervous about going to work.” Selected items capture core dimensions of nursing stress (global pressure, tension, sleep disturbance, and work avoidance), aligning with Lazarus’s transactional stress model []. This scale is a validated tool that has been extensively used as a measure of job pressure and psychological distress in both medical staff and general occupational research, thus demonstrating its applicability to this study []. The total scores ranged from 4 to 24 and demonstrated high internal consistency (Cronbach α=0.94; composite reliability=0.88).
Work Satisfaction Scale
Work satisfaction was measured using a 10-item scale assessing several dimensions: overall job satisfaction, satisfaction with colleagues, expected income, leadership, working facilities, promotion prospects, internal management, welfare benefits, training opportunities, and opportunity for skill use []. Participants rated each item on a 6-point Likert scale ranging from “1=very dissatisfied to 6=very satisfied,” resulting in a total score from 10 to 60. The scale exhibited excellent internal consistency (Cronbach α=0.95). The full details of the scale are provided in Part B of .
Digital Health Care Technology Adoption Scale
Current literature indicates that both the general public and health care professionals widely recognize the significant potential benefits and barriers associated with DHTs [,,] or eHealth services []. With the aim of thoroughly investigating practicing physicians’ perspectives and preferences related to the implementation of DHTs, we developed a 14-item DHT adoption scale comprising 3 dimensions, based on a comprehensive literature review [,]. The scale development process, including expert validation procedures and pilot testing protocols, is provided in detail in Section 2, . Specifically, the selection of the 3 core dimensions—Perceived Benefits, Adoption Barriers, and Behavioral Intention—was guided by established technology adoption theories, notably the technology acceptance model and the UTAUT theories, which posit that behavioral intention is determined by a trade-off between perceived benefits (eg, usefulness) and perceived costs or barriers (eg, ease of use and risks) []. Also, recognizing that personal preference does not always translate into actual use, we incorporated a third dimension, Behavioral Intention, to capture a more behavioral measure of overall adoption willingness. This tripartite structure allows for a more comprehensive assessment that spans attitudinal, perceptual, and behavioral aspects of adoption.
Within the Perceived Benefits domain, which consists of 8 items, 4 specific indicators were identified as the most frequently cited drivers of DHT adoption in systematic reviews and physician surveys. These indicators include (1) improved diagnostic and treatment quality, (2) enhanced patient trust and satisfaction, (3) error rate reduction, and (4) increased income (driven by improved diagnostic and treatment efficiency) [,]. From the physician’s perspective, these represent core utilitarian, relational, and practical incentives. Similarly, the Adoption Barriers domain contains 5 items, with 4 key indicators consistently highlighted in previous literature as the most prevalent and impactful obstacles. These indicators comprise (1) technical barriers, (2) cybersecurity risks, (3) workload increase, and (4) patient experience reduction [,], reflecting central concerns regarding feasibility, security, and clinical workflow. The third dimension, Behavioral Intention, was assessed using a single-item scale designed to measure overall willingness to adopt. This provides a pragmatic measure of behavioral outcomes, complementing the multidimensional perceptual factors. Taken together, this framework ensures the scale captures both the complexity of DHT adoption decisions and a concrete behavioral intention.
All items were rated on a 5-point Likert scale, with each indicator score standardized to a range of 1 to 5. Higher scores in the Perceived Benefits domain indicated that participants recognized greater potential benefits of DHTs, whereas lower scores in the Adoption Barriers domain suggested that participants perceived higher potential costs and risks associated with DHT implementation. Correspondingly, higher scores in the Behavioral Intention domain demonstrated increased likelihood of both initial adoption and sustained usage of DHTs. The scope of DHTs considered in this study and the specific items included in the DHT scale are provided in Part A of . This scale demonstrated high internal consistency, with a Cronbach α of 0.88. Detailed information regarding the validity of the scale is provided in Table S5 of .
Data Analysis
Descriptive statistics and bivariate correlations were analyzed using Stata 17 (StataCorp LLC). Mplus version 8.3 (Muthén & Muthén) software was used to conduct the LPA and identify the DHT subgroups based on 9 domains (4 benefit domains, 4 barrier domains, and 1 objective domain). We assessed model fit using a comprehensive set of indices [], including the Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted BIC (aBIC), entropy, the Lo-Mendell-Rubin likelihood ratio test, and the bootstrap likelihood ratio test (BLRT). Lower values of AIC, BIC, and aBIC indicated better model fit []. The Lo-Mendell-Rubin likelihood ratio test and BLRT were used to compare improvements in model fit between adjacent models, with a significant P value (P<.05) suggesting that the class-k model provided a better fit than the class k-1 model. Entropy values, ranging from 0 to 1, were used to evaluate classification quality, with values closer to 1 indicating clearer class separation. In addition, the average posterior probability of class membership was examined, with values ≧0.80 indicating good discriminability. To ensure the validity of the results, each class was required to comprise more than 5% of the total sample []. The uncertainty in the estimated latent profile proportions was quantified using 95% CIs, constructed via a nonparametric bootstrap approach with 1000 replications. This method is robust and does not rely on distributional assumptions, making it particularly suitable for latent variable models.
Next, we performed ANOVA to compare DHT subscale scores across the 5 latent classes. Between-group differences in demographic, health, and occupational characteristics across DHT subtypes were assessed using χ2 tests (for categorical variables) and ANOVA (for continuous variables). To examine the relationships between the identified DHT profiles and key variables, we performed multivariate multinomial logistic regression analyses. Multicollinearity was assessed using variance inflation factor analysis (Table S4 in ). These models assessed the associations between DHT profiles and various predictors, with statistical significance determined at P<.05 (2-tailed).
Ethical Considerations
This study collected solely demographic and professional information, excluding any sensitive or personally identifiable biological data. The study protocol was approved by the Biomedical Ethics Committee of Xi’an Jiaotong University (approval no XJTUAE-2647). Electronic informed consent was obtained from all participants, and institutional authorization was granted by the Xi’an Municipal Health Commission. For the secondary analysis of the research data, we confirmed that the original ethical approval and consent procedures for the “2023 Healthcare Worker Survey” permitted the reuse of data for public health and policy studies without additional participant consent.
In this study, we prioritized the privacy and confidentiality of participants. The survey was designed to collect only nonsensitive information without any personally identifiable data. All data were deidentified at the time of collection, and analyses were conducted on aggregated datasets to prevent reidentification. Participants were not offered any form of compensation, as the survey was part of routine institutional activities. No images or multimedia materials that could lead to the identification of any individual are included in the paper or supplementary files.
Results
Descriptive Statistics and Correlations
A total of 4851 Chinese registered doctors from 46 health care facilities (including 26 Level-II hospitals and 20 Level-III hospitals) in Xi’an were analyzed in this study. The mean age was 38.37 (SD 8.67) years, with a range of 20 to 80 years. Among the participants, 2944 (60.69%) were female, and 1907 (39.31%) were male. In terms of education, 56.17% (2725/4851) held graduate degrees (master’s or doctoral degrees), while 43.83% (2126/4851) had a bachelor’s degree or below.
Among the 9 items in the DHT perception scale, the diagnosis and treatment quality indicator had the highest mean score of 3.98 (SD 0.78) in the benefit domain, while the income increase indicator had the lowest mean score of 3.08 (SD 1.01). In the barrier domain, the patient experience reduction indicator had the highest mean score of 3.80 (SD 0.96), whereas the workload increase indicator had the lowest mean score of 3.59 (SD 0.98). The mean score for the overall willingness indicator was 3.69 (SD 0.89). In terms of job-related scales, the mean scores for work satisfaction, occupational stress, and doctor-patient relationship perception were 44.30 (SD 9.69), 16.22 (SD 4.85), and 7.85 (SD 2.08), respectively. The bivariate correlations among the study variables are provided in Table S1 of . All indicators of DHT were moderately correlated; furthermore, compared to correlation analysis, LPA offers a more detailed characterization of Chinese doctors’ diverse perspectives on DHT.
Detecting Latent Profiles
The model fit statistics for the 1‐6 latent profile models are provided in . With an increase in the number of latent profiles, the AIC, BIC, and aBIC gradually decreased, and the BLRT showed significant results in comparisons between all models with k and k–1 classes. Although the class-6 model demonstrated the best fit based on AIC, BIC, aBIC, and entropy, the first group in this model included only 77 participants (1.6% of the total sample), leading to the rejection of the class-6 model. Compared to the class-4 model, the class-5 model identified a new category with a distinct DHT-related response probability pattern. Based on its optimal balance of model fit and interpretability, the class-5 model was selected as the final solution. This model showed the highest classification accuracy among comparable models, with an entropy value of 0.883, indicating well-separated and mutually exclusive profiles. This finding is further supported by the high average posterior class probabilities provided in Table S3 in .
Table 1. Model fit indices for the compared latent profile analysis models evaluating digital health technology adoption among physicians in China (cross-sectional survey, 2023; N=4851).
Model
AIC
BIC
aBIC
pLMR
pBLRT
Entropy
Group size for each profile
1
2
3
4
5
6
Class-1
113430.03
113546.79
113489.59
—
—
—
4851
—
—
—
—
—
Class-2
107352.93
107534.56
107445.59
<.001
<.001
0.760
2292
2559
—
—
—
—
Class-3
102959.52
103206.02
103085.26
<.001
<.001
0.830
2326
617
1908
—
—
—
Class-4
99087.54
99398.91
99246.38
<.001
<.001
0.882
1120
584
2485
562
—
—
Class-5
96769.86
97146.10
96961.80
<.001
<.001
0.883
516
1003
2276
545
511
—
Class-6
95262.60
95703.71
95487.65
<.001
<.001
0.889
528
77
1149
2082
498
517
aAIC: Akaike information criterion.
bBIC: Bayesian information criterion.
cABIC: adjusted BIC.
dpLMR: P value for LoMendell-Rubin adjusted likelihood ratio test for K versus K–1 profiles.
epBLRT: P value for bootstrapped likelihood ratio test.
fNot applicable.
The latent profile memberships showed significant differences in the means of the 8 indicator variables (as provided in Table S2 in ), and their characteristics are summarized in . The LPA was conducted to identify physician subgroups based on their standardized responses (on a 1–5 scale) across 3 key domains: Perceived Benefits, Adoption Barriers, and Behavioral Intention. The Perceived Benefits domain encompassed four indicators: (1) improved diagnostic and treatment quality, (2) enhanced patient trust and satisfaction, (3) error rate reduction, and (4) increased income. The Adoption Barriers domain included: (1) technical barriers, (2) cybersecurity risks, (3) workload increase, and (4) patient experience reduction. The Behavioral Intention domain measured the overall willingness to adopt. In the resulting profiles (Figure 1), higher scores in Perceived Benefits and Behavioral Intention indicate more positive perceptions and a greater likelihood of adoption, respectively. Conversely, higher scores in Adoption Barriers signify that physicians perceived these obstacles as more severe. The ANOVA and Bonferroni post hoc tests indicated that DHT subscale scores differed in all 5 classes (P<.001), with the “Error Rate Reduction” variable exhibiting the largest effect size (η2=0.627). In , Class 1 (n=516, 10.64% of the sample; 95% CI 9.76%-11.52%) demonstrated a distinctive pattern characterized by high perceived benefits, high perceived barriers, yet positive overall willingness toward DHTs. This profile represents physicians who recognize both notable advantages and substantial risks of digital health tools, but tend to maintain a generally positive willingness to adopt and use these technologies. Their pattern could suggest a risk-aware yet largely optimistic approach to digital transformation, potentially serving as engaged evaluators who might help optimize DHT implementation while acknowledging its challenges. This unique profile was therefore classified as the “Reform-Adaptable” group. Class 2 (n=1003, 20.68% of the sample, 95% CI 19.50%-21.86%) exhibited consistently low scores across all dimensions, suggesting generally skeptical attitudes toward DHTs. This profile appears to reflect physicians who perceive relatively minimal benefits while emphasizing substantial barriers, resulting in largely negative adoption intentions. Their resistance seems rooted in both practical concerns about implementation challenges and some fundamental doubts about the value of DHTs. This group was designated the “Negative” group. Class 3 (n=2276, 46.92% of the sample; 95% CI 45.50%-48.34%) was characterized by moderate scores near the average on all subscales. We interpret this pattern as representing physicians who acknowledge both the advantages and limitations of DHTs without a firm stance. This neutral position likely entails a “wait-and-see” approach, where adoption is contingent on contextual factors such as organizational support and peer behavior. Based on this rationale, we identified this group as the “Neutral” profile. Class 4 (n=545, 11.23% of the sample; 95% CI 10.33%-12.13%) presented a profile of low perceived benefits, low perceived barriers, and cautious overall willingness. These physicians appear to perceive limited advantages from DHTs while also minimizing implementation risks, resulting in generally low adoption intentions that seem based more on skepticism about the fundamental value proposition of DHTs rather than specific implementation concerns. This group was therefore labeled the “Reform-Conservative” group. Class 5 (n=511, 10.53% of the sample; 95% CI 9.66%-11.40%) displayed uniformly high scores across all subscales, implying favorable dispositions toward DHTs. This profile may represent physicians who recognize strong benefits, tend to minimize perceived barriers, and demonstrate relatively high adoption willingness. Their pattern suggests generally positive acceptance of digital transformation and potential leadership roles in promoting DHT implementation within their institutions. Consequently, this group was classified as the “Positive” group.
Figure 1. Characteristics of the 5 digital health technology (DHT) adoption profiles identified by latent profile analysis among hospital-based physicians in China (cross-sectional survey, 2023; N=4851), based on patterns of Perceived Benefits, Adoption Barriers, and Behavioral Intention.
Comparison of Demographic and DHT Scales in Each Latent Profile
outlines the comparison of demographic and job-related variables across different latent profiles. Significant differences were observed among the 5 DHT classes for variables such as gender, education background, income level, professional and technical title, working hours per week, years of health care work experience, self-rated health, work satisfaction, doctor-patient relationship perception, and occupational stress (all P<.05). However, no significant differences were found for age and night shift status across the 5 DHT profiles.
Table 2. Association between identified digital health technology adoption profiles and demographic and occupational characteristics among physicians in China (cross-sectional survey, 2023; N=4851).
g ANOVA F tests are used for continuous variables; F (df1, df2).
h Chi-square tests (χ² tests) are used for categorical variables; Chi-square (df).
As shown in , the Positive group (Class 5) demonstrated significantly higher proportions of participants affiliated with Level-II hospitals (χ24=38.32; P<.001), holding resident physician titles (χ28=44.96; P<.001), and possessing bachelor’s degrees (χ24=15.50; P<.001) compared with other groups. Notably, this group also reported the highest mean scores in both work satisfaction (mean 51.49, SD 9.92) and occupational stress (mean 18.82, SD 5.75).
Multivariate Multinomial Regression Results
and show the associations between key predictors and latent profile membership, using the subsequent class in each column as the reference. Male physicians were less likely to belong to the Neutral (Class 3) and Reform-Conservative (Class 4) groups compared with both the Reform-Adaptable (Class 1) and Negative (Class 2) groups (all odds ratios [ORs] <1), but more likely to belong to the Positive group (Class 5) than to Class 4 (OR 1.39, 95% CI: 1.05-1.84; P=.02). Those with a master’s degree or higher were less likely to be in Class 4 than Class 3 (OR 0.75, 95% CI 0.59‐0.96; P=.02). When using Class 2 as the reference, better self-rated health was significantly associated with higher odds of belonging to Class 1 (OR 1.21, 95% CI 1.03‐1.42; P=.02), Class 3 (OR 1.20, 95% CI 1.07‐1.34; P=.001), and Class 5 (OR 1.32, 95% CI 1.12‐1.55; P=.001). These graded associations indicate that gender, education, and self-rated health are important differentiating factors across distinct DHT perception profiles. However, contrary to expectations derived from existing literature, our findings revealed that age, professional title, and years of work experience did not significantly predict DHT adoption profile membership among physicians in the Chinese sample (all P>.05), suggesting important contextual differences in the determinants of DHT adoption.
Table 3. Multinomial logistic regression results (Part A) examining the demographic and occupational predictors of membership in the 5 digital health technology adoption profiles among Chinese physicians (cross-sectional survey, 2023; N=4851).
Variable
Class 5 vs Class 1, OR (95% CI)
Class 5 vs Class 2, OR (95% CI)
Class 5 vs Class 3, OR (95% CI)
Class 5 vs Class 4, OR (95% CI)
Class 2 vs Class 1, OR (95% CI)
Age (years)
0.99 (0.96‐1.02)
0.98 (0.95‐1.00)
1.00 (0.98‐1.03)
0.99 (0.96‐1.01)
1.01 (0.98‐1.04)
Gender (ref: female)
Male
0.89 (0.68‐1.18)
0.93 (0.71‐1.19)
1.23 (0.99‐1.54)
1.39 (1.05-1.84)
0.96 (0.75‐1.22)
Educational background (ref: bachelor’s degree and below)
Master’s degree and above
0.90 (0.64‐1.27)
1.01 (0.75‐1.36)
0.94 (0.72‐1.22)
1.25 (0.89‐1.75)
0.90 (0.67‐1.20)
Hospital grade (ref: Level-II)
Level-III
0.57 (0.39‐0.82)
0.66 (0.48‐0.90)
0.80 (0.61‐1.05)
0.56 (0.39‐0.81)
0.86 (0.62‐1.20)
Professional title (ref: resident physician)
Attending physician
1.06 (0.72‐1.54)
1.24 (0.88‐1.74)
1.18 (0.88‐1.60)
1.30 (0.87‐1.94)
0.85 (0.61‐1.19)
Chief physician
1.10 (0.63‐1.93)
0.93 (0.56‐1.52)
1.07 (0.69‐1.67)
0.89 (0.51‐1.58)
1.19 (0.73‐1.93)
Annual income level (ref: low)
Middle
0.90 (0.65‐1.25)
0.99 (0.74‐1.32)
0.82 (0.63‐1.06)
0.72 (0.51‐1.01)
0.91 (0.69‐1.22)
High
0.92 (0.62‐1.36)
1.01 (0.72‐1.44)
0.78 (0.57‐1.07)
0.43 (0.29‐0.63)
0.90 (0.64‐1.27)
Working hours (ref: ≤48 h/wk
>48 h/wk
0.78 (0.58‐1.03)
0.74 (0.58-0.96)
0.89 (0.71‐1.11)
0.60 (0.45‐0.80)
1.04 (0.81‐1.34)
Night shifts (ref: ≤4 nights/time per month)
>4 nights/time per month
0.86 (0.65‐1.15)
1.00 (0.78‐1.30)
1.02 (0.81‐1.28)
1.15 (0.86‐1.54)
0.86 (0.67‐1.11)
Health care working experience (ref: ≤10 years)
>10 years
0.89 (0.57‐1.39)
1.07 (0.73‐1.59)
0.92 (0.65‐1.30)
1.15 (0.74‐1.79)
0.83 (0.57‐1.21)
Self-rated health status
1.09 (0.91‐1.29)
1.32 (1.12‐1.55)
1.10 (0.95‐1.26)
1.23 (1.02-1.48)
0.83 (0.70-0.97)
Work Satisfaction Scale
1.04 (1.02‐1.06)
1.14 (1.12‐1.16)
1.10 (1.09‐1.12)
1.16 (1.14‐1.18)
0.91 (0.90‐0.93)
Doctor-Patient Relationship Scale
1.08 (1.01‐1.16)
0.86 (0.81‐0.92)
0.94 (0.89‐0.99)
0.77 (0.72‐0.82)
1.25 (1.17‐1.33)
Occupational Stress Scale
1.26 (1.22‐1.30)
1.18 (1.15‐1.22)
1.13 (1.11‐1.16)
1.12 (1.08‐1.15)
1.07 (1.04‐1.09)
aClass 1: Reform-Adaptable group.
bClass 2: Negative group.
cClass 3: Neutral group.
dClass 4: Reform-Conservative group.
eClass 5: Positive group.
fOR: odds ratio.
gBolded ORs indicate significance.
hP<.05.
iP<.01.
Table 4. Multinomial logistic regression results (Part B) examining the demographic and occupational predictors of membership in the 5 digital health technology adoption profiles among Chinese physicians (cross-sectional survey, 2023; N=4851).
Variable
Class 4 vs Class 1, OR (95% CI)
Class 4 vs Class 2, OR (95% CI)
Class 4 vs Class 3, OR (95% CI)
Class 3 vs Class 1, OR (95% CI)
Class 3 vs Class 2, OR (95% CI)
Age (years)
1.00 (0.97‐1.03)
0.99 (0.96‐1.01)
1.02 (0.99‐1.04)
0.99 (0.97‐1.01)
0.98 (0.96‐1.00)
Gender (ref: female)
Male
0.64 (0.48‐0.85)
0.67 (0.54‐0.84)
0.89 (0.72‐1.10)
0.72 (0.58‐0.90)
0.76 (0.64‐0.89)
Educational background (ref: bachelor’s degree and below)
Master’s degree and above
0.73 (0.52‐1.01)
0.80 (0.62‐1.05)
0.75 (0.59-0.96)
0.97 (0.74‐1.25)
1.07 (0.89‐1.30)
Hospital grade (ref: Level-II)
Level-III
1.01 (0.69‐1.48)
1.17 (0.86-1.59)
1.43 (1.08-1.89)
0.71 (0.53-0.95)
0.82 (0.66-1.01)
Professional title (ref: resident physician)
Attending physician
0.81 (0.55‐1.21)
0.95 (0.68‐1.33)
0.91 (0.67‐1.23)
0.89 (0.67‐1.20)
1.05 (0.84‐1.31)
Chief physician
1.23 (0.70‐2.17)
1.03 (0.65‐1.63)
1.20 (0.79‐1.83)
1.02 (0.67‐1.59)
0.87 (0.63‐1.19)
Annual income level (ref: low)
Middle
1.25 (0.90‐1.76)
1.38 (1.04-1.82)
1.13 (0.87‐1.47)
1.10 (0.85‐1.43)
1.21 (1.00-1.46)
High
2.15 (1.45‐3.18)
2.38 (1.73‐3.26)
1.83 (1.37‐2.45)
1.17 (0.86‐1.59)
1.29 (1.03-1.62)
Working hours (ref: ≤48 h/wk)
>48 h/wk
1.30 (0.98-1.73)
1.25 (1.00-1.59)
1.48 (1.19‐1.83)
0.88 (0.70‐1.10)
0.84 (0.72-1.00)
Night shifts (ref: ≤4 nights/time per month)
>4 nights/time per month
0.75 (0.56‐1.01)
0.87 (0.69‐1.10)
0.88 (0.71‐1.10)
0.85 (0.68‐1.06)
0.99 (0.83‐1.17)
Health care working experience (ref: ≤10 years)
>10 years
0.77 (0.50‐1.19)
0.93 (0.65‐1.33)
0.80 (0.58‐1.10)
0.97 (0.69‐1.36)
1.17 (0.91‐1.51)
Self-rated health status
0.89 (0.73‐1.07)
1.07 (0.92‐1.26)
0.89 (0.77‐1.03)
0.99 (0.86‐1.14)
1.20 (1.07‐1.34)
Work Satisfaction Scale
0.89 (0.88‐0.91)
0.98 (0.96‐0.99)
0.95 (0.94‐0.96)
0.94 (0.93‐0.95)
1.03 (1.02‐1.04)
Doctor-Patient Relationship Scale
1.40 (1.30‐1.51)
1.12 (1.06‐1.19)
1.23 (1.16‐1.29)
1.14 (1.08‐1.21)
0.92 (0.88‐0.96)
Occupational Stress Scale
1.13 (1.09‐1.16)
1.06 (1.03‐1.09)
1.02 (1.01-1.04)
1.11 (1.09‐1.14)
1.04 (1.02‐1.06)
aClass 1: Reform-Adaptable group.
bClass 2: Negative group.
cClass 3: Neutral group.
dClass 4: Reform-Conservative group.
eClass 5: Positive group.
fOR: odds ratio.
gBolded ORs indicate significance.
hP<.05.
iP<.01.
Notably, several work-related patterns emerged from the analysis. Physicians from tertiary (Level-III) hospitals were significantly less likely to be in Class 5 than in Classes 1, 2, and 4 (OR 0.57, 95% CI 0.39‐0.82; OR 0.66, 95% CI 0.48‐0.90; and OR 0.56, 95% CI 0.29‐0.81, respectively; all P=.001), but more likely to be classified in Class 4 than in Class 3 (OR 1.43, 95% CI 1.08‐1.89; P=.008). Furthermore, higher income was strongly associated with membership in Class 4 compared with all other classes (vs Class 1: OR 2.15, 95% CI 1.45‐3.18; vs Class 2: OR 2.38, 95% CI 1.73‐3.26; vs Class 3: OR 1.83, 95% CI 1.37‐2.45; vs Class 5: OR 2.34, 95% CI 1.58‐3.48; all P=.001). Similarly, working more than 48 hours per week significantly increased the likelihood of belonging to Class 4 relative to Classes 2, 3, and 5 (OR 1.25, 95% CI 1.08‐1.89, P=.045; OR 1.48, 95% CI 1.19‐1.83, P=.001; OR 1.67, 95% CI 1.24‐2.22, P=.001, respectively). When compared with Class 2, members of Class 3 were more likely to have higher income levels (middle income: OR 1.21, 95% CI 1.00‐1.46, P=.047; high income: OR 1.29, 95% CI 1.03‐1.62, P=.03) yet less likely to work over 48 hours per week (OR 0.84, 95% CI 0.72‐1.00; P=.044).
Compared with Class 1, individuals with higher work satisfaction were more likely to belong to Class 5 (OR 1.04, 95% CI 1.02‐1.06), while those with lower work satisfaction showed greater probabilities of membership in Class 2 (OR 0.91, 95% CI 0.90‐0.93), Class 3 (OR 0.94, 95% CI 0.93‐0.95), and Class 4 (OR 0.89, 95% CI 0.88‐0.91). Higher occupational stress and more positive doctor-patient relationship perceptions were also significantly associated with membership in Classes 2, 3, 4, and 5 relative to Class 1 (all P=.001). When compared with Class 2, higher work satisfaction (OR 1.03, 95% CI 1.02‐1.04) and more negative doctor-patient relationship perceptions (OR 0.92, 95% CI 0.88‐0.96) predicted membership in Class 3, whereas lower work satisfaction (OR 0.98, 95% CI 0.96‐0.99) and more positive relationship perceptions (OR 1.12, 95% CI 1.06‐1.19) were associated with Class 4. Higher occupational stress elevated the probability of classification into both Class 3 (OR 1.04, 95% CI 1.03‐1.09) and Class 4 (OR 1.06, 95% CI 1.03‐1.09). Also, using Class 3 as the reference, higher work satisfaction reduced the likelihood of belonging to Class 4 (OR 0.95, 95% CI 0.94‐0.96), while more positive doctor-patient relationship perceptions increased it (OR 1.23, 95% CI 1.16‐1.29). All reported associations were statistically significant (P=.001).
Furthermore, compared with physicians in Classes 2, 3, and 4, those in Class 5 demonstrated distinct characteristics across 3 key domains. Specifically, Class 5 physicians showed significantly higher odds of severe occupational stress (OR range 1.12‐1.18; P=.001), reported greater work satisfaction (OR range 1.10‐1.16; P=.001), yet held less positive expectations regarding doctor-patient relationships (OR range 0.77‐0.94; P=.001; refer to and for details).
Discussion
Principal Findings
This study accomplished its 2 primary objectives by applying LPA to examine physicians’ adoption of DHTs. First, using a tripartite framework (Perceived Benefits, Adoption Barriers, and Behavioral Intention), the analysis identified 5 clinically meaningful profiles that moved beyond conventional classifications [,]: Reform-Adaptable (n=516, 10.64%), Negative (n=1003, 20.68%), Neutral (n=2276, 46.92%), Reform-Conservative (n=545, 11.23%), and Positive (n=511, 10.53%). Second, the analysis demonstrated that profile membership was systematically correlated with a range of key demographic and occupational factors, including gender, education, income, hospital tier, working hours, self-rated health, occupational stress, job satisfaction, and perceptions of doctor-patient relationships. This association confirms the substantial heterogeneity in DHT adoption among physicians. Given their pivotal role in implementing DHTs to enhance patient care [], this divergence warrants attention and further investigation. By identifying the specific factors linked to each profile, our findings provide an empirical basis for developing tailored implementation strategies that account for these distinct physician subgroups.
In this study, we found that levels of occupational stress and work satisfaction differed significantly across the 5 latent profiles. Specifically, physicians reporting relatively high occupational stress alongside high work satisfaction were more likely to belong to Class 5 (Positive group), a profile characterized by greater perceived benefits and fewer adoption barriers regarding DHT implementation. To interpret this seemingly counterintuitive association, we used the Job Demands-Resources framework [], which posits that high job demands can motivate the adoption of functional resources, including digital tools, to mitigate work pressure. Our findings support this mechanism: physicians in the Positive group indicated that DHTs contributed to improved work efficiency and better management of daily workloads, notably by facilitating remote consultations and streamlining follow-up processes. Rather than perceiving digital tools as additional burdens, these physicians used DHTs as strategic resources to maintain autonomy and reduce time-related pressures. This observation aligns with previous studies indicating that health care professionals under high workload demands often adopt efficiency-enhancing technologies, including automated electronic health records, to alleviate operational strain and prevent burnout [].
Furthermore, we found that the combination of high stress and high job satisfaction likely reflects a subgroup of physicians who are highly engaged and adaptive. In our sample, those with greater work satisfaction (often stemming from institutional trust and personal adaptability) were generally more receptive to technological innovations promising improved efficiency, such as telemedicine systems []. Thus, our results suggest that, for certain physicians, occupational challenges may not inhibit but could even stimulate willingness to adopt practical digital solutions.
A notable divergence emerged between these findings and those of previous studies in the Western context [,], which identified physician age as a significant predictor of DHT adoption patterns. One plausible explanation may lie in the comprehensive integration of digital technologies within China’s health care system. The mandatory adoption of health codes during the COVID-19 pandemic and the widespread implementation of internet-based consultation systems may have reduced age-related digital disparities among physicians, diminishing the influence of online age as a distinguishing factor in DHT adoption. In addition, gender differences in DHT adoption patterns may reflect broader sociocultural dynamics within Chinese healthcare service systems. Female physicians—who comprised most of our sample—often bear disproportionate responsibilities for both clinical work and family care, which may limit their capacity to engage with new technologies that require additional training time. Previous studies suggest that women in healthcare settings, both in China and globally, tend to adopt a more cautious approach to technology adoption, prioritizing established practicality and reliability over novelty [,]. We also found that income level emerged as a significant predictor, likely reflecting structural aspects of China’s compensation system. Physicians in higher income brackets, often concentrated in specialized fields and tertiary hospitals, may perceive less economic incentive to adopt DHTs that could disrupt established workflows without immediate financial benefits. Conversely, physicians in lower-income segments might view DHTs as potential tools for improving efficiency and patient volume, thereby increasing earnings [].
Furthermore, while no significant differences were observed across professional titles, physicians working in secondary hospitals demonstrated a more positive perception of DHTs, reporting higher perceived benefits and lower barriers to adoption compared with those in tertiary hospitals. This divergence may reflect systemic differences within China’s tiered health care system. Physicians in tertiary hospitals frequently face overwhelming clinical workloads and academic pressures, which may contribute to innovation fatigue despite their greater access to technological resources. In contrast, secondary hospital physicians may perceive DHTs as strategic tools for enhancing institutional competitiveness and addressing resource constraints through telemedicine collaborations with tertiary centers. These findings suggest that implementing targeted DHT strategies in secondary hospitals could be particularly effective for improving service quality and patient satisfaction. For example, the COVID-19 pandemic catalyzed the widespread deployment of teleconsultation platforms to ensure continuity of care [,]. Videoconferencing enables not only remote patient monitoring but also real-time supervision of clinical teams by specialists from tertiary hospitals []. Evidence shows that many DHTs provide affordable platforms for grassroots hospitals to collaborate with advanced medical centers. Through structured initiatives, including clinician exchanges, treatment protocol standardization, and technical assistance, DHTs have significantly improved the quality of care at primary health care institutions and are strongly aligned with China’s tiered health care policy objectives [,]. These technologies help bridge resource gaps and expand access to specialized care, particularly for patients in secondary hospitals. The distinct patterns identified in this study, such as the reduced role of physician age and heightened receptivity in secondary hospitals, are shaped by China’s specific health care policy landscape [].
In fact, the national “Healthy China 2030” strategy explicitly prioritizes the integration of the internet, AI, and big data technologies throughout health care delivery []. This top-down mandate has catalyzed widespread institutional adoption of DHTs, creating an environment where exposure to digital tools is becoming universal. The rapid implementation of the health code system and telemedicine platforms during the COVID-19 pandemic, for instance, served as a form of nationwide digital training, which likely enhanced digital literacy among physicians of all demographic backgrounds and may have diminished conventional disparities associated with age []. Furthermore, as secondary hospitals are often direct targets of policy support and funding for digital capacity building, physicians in these settings report more positive perceptions of DHTs, viewing them as tools for professional advancement and better patient care. These findings may be generalizable to other health systems that use strong top-down digital integration policies and tiered care models, though local infrastructure and policy intensity would influence applicability.
Moreover, physicians with higher income levels, those working more than 48 hours per week, and those reporting more favorable doctor-patient relationships were more likely to belong to the Reform-Conservative group (Class 4), which perceived relatively low levels of both benefits and barriers associated with DHTs and maintained a conservative stance toward adoption. The association between more favorable doctor-patient relationships and membership in the Reform-Conservative group presents a theoretically intriguing paradox that merits elaboration. Rather than reducing DHT adoption, we believe this is because physicians with established positive patient relationships may perceive less need for DHTs that could potentially disrupt these carefully maintained interpersonal dynamics.
Within the Chinese health care context, where traditional relationship-centered models of care remain highly valued, physicians with strong patient relationships may view DHTs as potentially undermining the personal connection and trust they have cultivated. These physicians might perceive digital tools as introducing a layer of technological mediation into what they consider to be essentially human interactions, potentially diluting the emotional quality of care. Conversely, physicians experiencing challenges in patient communication might view DHTs as tools to enhance efficiency, standardize interactions, or overcome communication barriers, thus increasing their adoption motivation. This interpretation suggests that doctor-patient relationship quality operates not simply as a demographic variable but as a significant indicator of clinical satisfaction and practice style that consistently influences technology adoption decisions. Alternatively, this preference for traditional health care models may stem from the lack of observed improvements in service quality or efficiency post-DHT implementation in their settings, particularly among more clinically experienced physicians in demanding specialties such as neurosurgery, critical care, and emergency medicine. For these physicians, adapting complex workflows to incorporate DHTs may exacerbate feelings of burnout []. Similarly, in these demanding clinical environments, greater emphasis is placed on physicians’ technical competencies and their ability to deliver patient-centered health care services, which may consequently diminish their perceived need for DHTs [].
In contrast, the Reform-Adaptable group demonstrates a risk-aware yet optimistic approach, recognizing significant benefits despite acknowledging implementation barriers, resulting in consistently high adoption intentions. This group exhibits greater flexibility, often engaging in selective adoption of technologies with clear clinical advantages and actively participating in pilot programs. Policy measures should accordingly diverge: for Reform-Conservative physicians, efforts must demonstrate fundamental value through evidence-based outcomes and success stories, whereas Reform-Adaptable physicians may benefit from targeted support, technical assistance, and roles as digital champions to address specific workflow integration concerns.
In addition, many health care systems have failed to fully operationalize the targeted intervention capabilities of AI and digital solutions []. Across numerous institutions, the fundamental requirements for successful DHT implementation remain challenging, as issues of service accessibility, standardized protocols, safety guarantees, and system reliability are still not adequately addressed []. As technological advancements progress and clinical feedback from various departments informs iterative improvements to DHT systems, emerging technological breakthroughs—alongside evolving patient attitudes toward digital health care—may gradually shift the perspectives of more conservative practitioners and facilitate wider DHT adoption [].
Notably, approximately 31% of the physician cohort expressed significant concerns regarding DHT implementation barriers, particularly related to technological challenges, cybersecurity risks, increased workload, and potential negative impacts on patient experience. Consistent with previous comprehensive reviews [,,,], our study revealed that health care workers, regardless of the level of care or the specific technology involved, face recurring challenges related to infrastructure, technology, training, legal and ethical issues, time constraints, and workload increases. Furthermore, limitations on widespread DHT adoption are often rooted in health care workers’ anxiety about increased workload and disruptions to their established routines. This anxiety can contribute to professional burnout, which, in turn, threatens the long-term sustainability of these technologies [,]. These findings suggest that future development of DHTs should focus on thoughtfully integrating digital solutions with conventional clinical workflows to establish hybrid care delivery models that may help mitigate potential workload increases and burnout risks. To adequately address physicians’ concerns regarding DHT implementation, health care institutions should consider implementing tailored support systems. Specifically, customized training programs and continuing medical education initiatives designed to meet individual physicians’ competency needs and practice contexts could potentially reduce psychological barriers and facilitate more widespread, sustainable DHT adoption. Such personalized approaches may prove particularly valuable in addressing the varied adoption patterns identified in our study while maintaining clinical workflow integrity [].
While this study focuses on Chinese physicians, our findings reveal both parallels and distinctions with international contexts. Consistent with European findings, skepticism regarding the clinical value and workflow impact of DHTs was prevalent [,]. However, unlike US research emphasizing financial incentives, DHT adoption in China was more influenced by institutional support [,]. Comparisons with other Asian settings showed similar hospital-level effects, though these were more pronounced in China’s policy-driven system. This suggests that while core adoption mechanisms may be universal, specific drivers remain culturally and systemically distinct [].
Implications for Policy and Practice
The heterogeneity observed in DHT adoption profiles highlights the limitations of relying solely on efficiency-driven models and underscores the necessity of multidimensional assessment frameworks to guide successful DHT implementation within health care systems. The key distinction between these profiles lies in their 3D evaluation: Perceived Benefits, Adoption Barriers, and Behavioral Intention. The Reform-Adaptable group, despite perceiving high barriers, maintains a high willingness due to strong benefit perception and requires barrier-specific support. In contrast, the Reform-Conservative group shows low willingness driven by limited perceived benefits, necessitating value demonstration interventions. This perceptual divergence calls for tailored implementation strategies rather than uniform policies. Profile-specific recommendations are provided in Section 3 of .
Furthermore, this profiling framework enables the proactive management of systemic risks, such as workload intensification and burnout, particularly among overworked physicians (>48 hrs/wk) and conservative adopters. To ensure sustainable integration, especially in complex tertiary hospitals, health care systems must prioritize co-designed solutions that address critical implementation determinants such as interoperability, cybersecurity, and equitable workload redistribution. Consequently, policymakers can further support sustainable adoption by institutionalizing holistic adoption metrics that balance efficiency gains with medical workers’ well-being, ensuring that DHTs enhance rather than exacerbate pressures on the health care system. Consistent with the principles of the NASSS (Nonadoption, Abandonment, Scale-up, Spread, and Sustainability) framework principles, these strategies emphasize the need for context-adaptive implementation across technological, organizational, and professional dimensions, making them practical and scalable for long-term success [].
Strengths and Limitations
The current findings reveal heterogeneity among Chinese physicians, suggesting the potential value of tailored institutional measures and policies for DHT implementation. This study sought to introduce a person-centered analytical approach by using latent profile analysis, which moves beyond exclusive reliance on variable-centered methods to explore distinct typologies of physicians based on their multidimensional perceptions. This exploratory approach identified 5 potential subgroups, offering an alternative perspective for understanding adoption heterogeneity.
We developed and applied a preliminary 3D evaluation framework, encompassing perceived benefits, barriers, and overall willingness, to capture variations in adoption patterns. Furthermore, we examined how individual characteristics and occupational factors were associated with profile membership. The analyses indicated that the organizational context (eg, hospital tier) appeared to play a more prominent role than individual demographics in some profiles. These findings contribute to understanding physician acceptance within China’s policy environment and may offer a transferable methodological approach for examining technology adoption in other health care settings.
The typological framework itself represents a key innovation, offering a nuanced and actionable perspective for developing tailored interventions. For example, physicians in the Reform-Adaptable subgroup might benefit from barrier-reduction support, while those in the Reform-Conservative subgroup may require a clearer demonstration of technology value. The observed patterns around organizational determinants offer insights suggesting that national policy contexts might influence technology adoption pathways. By considering the characteristics of the different physician subgroups, health care administrators could explore ways to improve work environments, adjust workflows, and enhance DHT operational capabilities, potentially supporting physician engagement with DHT implementation.
Our study has several limitations that need to be acknowledged. First, the cross-sectional design of our study limits our ability to establish temporality and causality. While the selected evaluation indicators for DHT include both beneficial and adverse factors, future research must examine how health care professionals’ preferences evolve to support stronger causal inferences. Second, while this study benefits from a large sample size, its generalizability may be limited by the exclusive focus on physicians from Xi’an, China. Regions with different economic development levels, digital infrastructure, and policy implementation—both within China and globally—may demonstrate different adoption patterns. The digital health landscape varies significantly across health care systems in terms of funding, regulation, and technological readiness. However, the identified latent profiles and organizational influences reflect fundamental mechanisms that may transfer across similar contexts. Future research should validate these findings across diverse socioeconomic and cultural settings, particularly in rural areas and other countries with different health care models. Third, self-reported measures may involve social desirability bias, though anonymity was ensured. Future studies should include objective behavioral data.
Future Research Directions
As noted in previous research, health care professionals’ work environments significantly influence their adoption of DHTs. Consequently, we propose the following specific research directions. First, qualitative approaches such as in-depth interviews and focus groups could elucidate the reasons for resistance, particularly among physician subgroups skeptical of or negative toward DHTs. Second, longitudinal and mixed methods studies are warranted to explore how workplace factors—including job stress and doctor-patient relationships—shape DHT preferences over time, and how such preferences may, in turn, shape perceptions of the work environment. Finally, future research should expand the evaluation of DHT adoption willingness by integrating motivational factors such as incentive structures, professional fulfillment, and opportunities for personal development. This would support the creation of more nuanced typologies of physician engagement and help identify context-dependent barriers and facilitators across varied clinical settings.
Conclusion
This study used latent profile analysis to identify 5 distinct subgroups of Chinese physicians based on their perceptions of DHT adoption, providing a practical framework for designing precision interventions. While the profiles reveal considerable diversity in adoption attitudes, they also highlight unifying concerns about usability and professional autonomy that persist across all profiles. Our findings suggest divergent intervention pathways corresponding to these profiles. Reform-Adaptable physicians appear most likely to benefit from technical support and workflow integration, whereas Reform-Conservative physicians may respond better to compelling evidence of clinical value and peer success stories. These insights provide health care administrators and policymakers with empirically grounded guidance for developing tailored implementation strategies rather than relying on standardized approaches. Future research should validate the longitudinal stability of these profiles and assess tailored interventions through rigorous real-world trials. Ultimately, by embracing this nuanced understanding, health care systems can evolve from uniform implementation to precision enablement, thereby enhancing both the practical impact and responsible scalability of DHTs and addressing shared physician concerns.
The authors would also like to thank the editor and reviewers for their helpful suggestions and valuable comments. Most importantly, we thank all participating physicians for sharing their experiences amid demanding workloads. We confirm that no generative artificial intelligence tools were used in the preparation of this manuscript.
This study received financial support from multiple sources: the Leading Talents Project in Philosophy and Social Sciences, National Social Science Foundation of China (grant no 2022LJRC02), and the National Natural Science Foundation of China (grant nos 72374169 and 72474174).
The datasets generated or analyzed during this study are not publicly available, as they form part of an official health survey administered by the Shaanxi Provincial and Xi’an Municipal Health Commissions. However, the data are available from the corresponding author on reasonable request and with permission from the relevant health authorities.
None declared.
Edited by Amaryllis Mavragani, Stefano Brini; submitted 20.May.2025; peer-reviewed by Ahmed Tausif Saad, Judy Bowen, Kamel Mouloudj; final revised version received 10.Oct.2025; accepted 10.Oct.2025; published 26.Nov.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Rapid advancement of digital technology in medicine has led to the deployment of numerous digital health tools and solutions, transforming health care delivery. Digital health offers various benefits, including improved access to health care, tailored treatment, and reduced costs [,]. However, people with limited digital health literacy may struggle to utilize these services, creating disparities in the digital era []. Addressing such utilization challenges requires careful development and introduction of technology []. Accordingly, health care professionals need robust methods to assess patients’ knowledge, skills, and attitudes regarding digital health technology []. Well-tested, multifaceted tools are essential to measure digital health literacy, determine user-specific barriers, and identify methods and resources to facilitate technology use.
The concept of digital health literacy, originally proposed by Norman and Skinner in 2006 as “eHealth literacy,” was defined as “the ability to seek, find, understand, and appraise health information from electronic sources and apply the knowledge gained to addressing or solving a health problem” []. Since then, the notion has been refined and expanded [,-]. Norman and Skinner [] developed the eHealth Literacy Scale (eHEALS), which was the first digital health literacy assessment tool. Available in over 20 languages, eHEALS is the most frequently investigated instrument [,]. However, because eHEALS predates smartphones and the widespread use of social networking services, it has limited scope regarding Web 2.0 applications [,]. To bridge this gap, additional digital health literacy assessment tools have been developed and are now available [-].
Super-Aged Japanese Society: Health Care Issues
Population aging is a global phenomenon, with “super-aged society” referring to a society with more than 21% of its population being older adults [,]. In Japan, currently about one-third of the population is ≥65 years old. This trend leads to increasing numbers of patients with both physical and cognitive challenges []. While the average life expectancy is 83.2 years, healthy life expectancy falls short at 73.9 years []. Consequently, older adults in Japan typically rely on health care services during the last decade of their lives. The challenges posed by a shortage of health care resources and the burden on Japan’s universal health care system are evident and require urgent solutions [].
Background and Rationale for Developing a Japanese Version of the eHealth Literacy Questionnaire
Given the challenges of an aging society, implementing digital health services may alleviate some strain on the Japanese health care system. For example, the increasing number of chronic diseases is one of the major challenges in aging societies, and using digital health services is recommended to properly self-manage these conditions [-]. If digital health services are designed to accommodate users’ digital health literacy, more people can benefit, and with targeted support for those facing difficulties, a broader population can be included. However, the status of digital health literacy in Japan, particularly among older adults, remains unclear. To address this, we translated the English version of the eHealth Literacy Questionnaire (eHLQ) into Japanese and applied it to people in Japan. We then evaluated the data to ensure its reliability. The eHLQ was chosen because it is a multifaceted instrument available in over 20 languages with widespread use in North America, Europe, and Asia-Pacific region [,-]. An instrument that has been used around the world is useful because digital health services are often developed by global companies and used worldwide []. The eHLQ measures not only the ability to use digital technology but also user experience and users’ perceptions of the services []. The multifaceted nature of the eHLQ is useful for evaluating digital health literacy across diverse populations, including those with limited access to digital technology who may recognize such services only indirectly through relatives or the media. In addition, the eHLQ is now used in the international initiative “Health literacy development for the prevention and control of noncommunicable diseases (NCDs)” promoted by the World Health Organization (WHO) [], with NCDs being common reasons for health care visits among elderly individuals.
Objectives and Study Scope
This study detailed the translation process to obtain the Japanese version of the eHLQ and presented a rigorous psychometric analysis based on classical test theory and item response theory (IRT). Additionally, it examined digital health literacy within Japan’s population using comparative analysis based on demographic factors. Although the study did not evaluate implementation, we discussed considerations, based on the findings, for the future development and facilitation of digital health services. This study benefits not only health care workers but also developers and providers of digital health systems. In turn, it is also beneficial for general users and their respective communities.
eHealth Literacy Framework: Conceptual Framework of the eHLQ
This study was based on the conceptual framework of “eHLF” (eHealth Literacy Framework) developed by Norgaard et al []. The eHLF was conceptualized from real-world observations through multiple workshops [], which stands out from various other frameworks in the same field [-,]. The eHLQ includes 35 items across 7 scales of the eHLF framework []: (1) Using technology to process health information, (2) Understanding of health concepts and language, (3) Ability to actively engage with digital services, (4) Feel safe and in control, (5) Motivated to engage with digital services, (6) Access to digital services that work, and (7) Digital services that suit individual needs. Each item has 4 possible responses: strongly disagree, disagree, agree, and strongly agree, scored from 1 to 4, respectively. Scales 1 and 2 demonstrate skills and knowledge of individuals, scales 3-5 indicate interactions of individuals and systems that influence perceptions, and scales 6 and 7 denote systems that shape user experiences [,] ().
Figure 1. 7 Scales of eHealth Literacy Questionnaire (eHLQ).
Definition of “Digital Health”
While Norman and Skinner first used the term “eHealth” to describe their new concept of literacy, the term “digital health” has been used widely in recent years, including by the WHO and European Commission [,]. The term “digital health” encompasses not only eHealth (the use of information and communication technology for health) but also other health-related technologies, serving as an umbrella term []. Accordingly, the WHO defines digital health literacy as “the ability to search, find, understand and evaluate health information from electronic resources and to use the knowledge gained to solve health-related problems” []. In this study, we use the term “digital health” unless “eHealth” is specifically required.
Methods
Study Design
This study employed a sequential exploratory mixed methods design [], containing a qualitative Phase 1 and a quantitative Phase 2a, and was extended with a quantitative Phase 2b. In Phase 1, the English version of the eHLQ was translated into Japanese and culturally adapted. Phase 2a involved its psychometric assessment using classical test theory and IRT approaches. In Phase 2b, snapshots of digital health literacy in Japan were analyzed based on demographic comparisons. Phase 2b used the datasets collected during Phase 2a.
Phase 1: Translation and Cultural Adaptation of the eHLQ
Overview
The Phase 1 study was performed following Translation Integrity Procedure (TIP; version 5), which was developed by Hawkins and Osborne []. TIP ensures cultural and linguistic appropriateness for the target audience and natural language and readability for those with low literacy levels and demonstrates equivalent measurement performance to the original source version []. A bilingual translator (YM) created the initial Japanese draft using a “translation management grid and item intents” provided by Swinburne University of Technology, the licensor of the eHLQ. The item intents described in the translation management grid were carefully considered during translation. The draft was then reviewed by a second bilingual translator (RS) for improvements. After revisions, the translators met to discuss linguistic and cultural equivalency between the English and Japanese versions. Upon agreement, a native English-speaking bilingual translator (MN) blinded to the original backtranslated the Japanese version.
Consensus Meetings and Cognitive Interviews
The forward and backward translations were sent to the original eHLQ developers for feedback. An online consensus meeting with the original eHLQ developers, held in February 2022, involved 7 experts in fields including digital health, public health, nursing, physiotherapy, medical education, pharmacy, and sociology. Each item was reviewed to ensure it aligned with the original context and intent. Attention was also paid to maintaining the versatile vocabulary used in the original eHLQ to ensure its longevity. Language consistency and item intent were verified, and revisions were made accordingly.
Cognitive interviews followed, using the consensus-approved version, to assess participant comprehension of the instructions, response format, and item content. Participants were recruited through personal contacts and their extended networks, purposefully selecting individuals diverse in region, degree of urbanization, age, gender, and education to minimize sampling bias. To maintain consistency, the interviews were performed by the first author. Participants received a 1000 JPY (equivalent to approximately US $6.50) gift card and reimbursement for any transportation costs. They completed interviews in person or online, reading each item aloud to ensure all terms and Kanji (Chinese characters used in Japanese writing) could be comprehended. To determine their understanding of cognitive factors underlying responses, participants were asked (in Japanese), “What were you thinking when you answered that question?” The following question was then asked (in Japanese) if needed: “Why did you select that response option?” The interviewer carefully took notes and confirmed participants’ comments on every item before moving to the next one. The interviews were also recorded with participants’ consent. Participants’ responses and comments were grouped by item and tabulated using an online spreadsheet platform, and the data were shared with the coauthors for review. Based on participant responses, if deemed necessary, the translations were adjusted to fit the items as intended. The final version after any adjustments following the interviews was shared with the original developers, and a final consensus meeting was held in April 2023.
Phase 2a: Psychometric Testing
Overview
The Japanese version of the eHLQ was administered to a large demographically representative sample to evaluate its psychometric properties. The Research Electronic Data Capture platform, developed at Vanderbilt University in the United States [,], was used for the survey. The questionnaire consisted of 35 eHLQ items, displayed with a maximum of 10 items per screen, demographic questions, questions about the frequency of information and communication technology (ICT) use, and questions about health status.
Recruitment
Participants were recruited through an online survey panel operated by a Japanese survey company (ASMARQ Co., Ltd., Tokyo, Japan) to ensure broad demographic representation, with screening criteria based on age, sex, location, and education. After screening, participants were directed to the main questionnaire online through Research Electronic Data Capture. Adults aged 18 years and older were eligible. Rather than recruiting only current older adults, we recruited the entire adult age range because aging societies encompass populations of all ages. Furthermore, the following were considered: (1) while the instrument was intended for long-term use, younger adults would age into the target cohort, and (2) older adults often rely on younger family members for help using digital health services, making younger adults relevant as well. The Ministry of Health, Labor, and Welfare, and Statistics Bureau of Japan defined those aged ≥65 years as elderly (comprising 36.1 million, 28.7% of the population) [,]. The participants were intentionally recruited from age groups in proportions mirroring the general population. Participants who completed the survey received points from the survey company. These points could be exchanged for a gift certificate worth approximately US $2. The web-based survey was conducted between July 3 and 17, 2023. At the time, internet users included 82.1%, 56.2%, and 26.4% of those aged in their 60s, 70s, and ≥80s, respectively []. Therefore, an in-person survey was conducted for those aged ≥65 years. Potential respondents were initially approached at hospitals affiliated with the authors’ institution. However, due to COVID-19 restrictions, an insufficient number of participants were available from this source, so participants were also recruited via “Silver Jinzai Center” facilities for older adults, which are nonprofit organizations in multiple regions of Japan that provide work for senior citizens in local communities. Participants received 2000 JPY (approximately US $13), including transportation fees, in line with the typical rate paid by the human resource center. The first author conducted in-person, face-to-face interviews with these participants and administered the eHLQ and the same demographic questions given to participants in the online survey. In total, a sample of over 500 participants was targeted, which was considered adequate for the measurement properties analyses [].
Classical Test Theory for Construct Validity
Confirmatory factor analysis was performed using Mplus Version 8.1 (Muthén & Muthén, Los Angeles, CA, USA), with 1-factor and 7-factor models for the scales. Since eHLQ scores are categorical, we used weighted least squares mean and variance estimators, which are robust and suitable for estimating categorical data []. The comparative fit index (CFI), standardized root mean residual (SRMR), and standardized expected parameter change (SEPC) were obtained with Mplus analysis using the MODINDICES (0) output option, with CFI >0.95 [] and SRMR ≤0.08 [], indicating acceptable model fit.
Item Response Theory
IRT, which is a statistical framework for comparing test versions using a standardized metric [], was applied to analyze item location and discrimination using Mplus Version 8.1 (Muthén & Muthén, Los Angeles, CA, USA). Boundary characteristic curves were plotted in Stata SE 18.0 (StataCorp, College Station, TX, USA) to visualize item difficulty, representing the probability of responses at various difficulty levels [].
Phase 2b: Descriptive Analysis
Score Analysis by Demographic Characteristics
Group differences impacting eHLQ scores were analyzed using ANOVA (version 29.0; SPSS , IBM, USA). P values less than .05 indicated statistical significance. Multiple comparisons were conducted using post hoc tests in SPSS with the Bonferroni correction. For 2-group comparisons, the independent t test was performed using SPSS. Two-sample t tests were conducted using 2-sided P values, with the equal variances assumption based on Levene test results (P≥.05 = equal variances assumed; P<.05 = equal variances not assumed).
Cohen d was used to quantify effect sizes, calculated as d=(M₁–M₂) ⁄ SDpooled. Effect sizes were interpreted as medium (0.50 ≤ d <0.80) and large (d≥0.80), both of which were considered worth discussing. Effect sizes with Cohen d below 0.50 were considered small.
Permission for Translation
Translation of the eHLQ to other languages requires a translation license. The authors obtained permission from Swinburne University of Technology, which manages the license. The authors also obtained permission from Prof. Lars Kayser, the corresponding author of the original eHLQ manuscript [].
Ethical Considerations
The Institutional Review Board of Juntendo University Faculty of Health Science reviewed and granted approval (Approval No. 22‐015). For the face-to-face version of the survey, including both cognitive interviews and in-person surveys for psychometric analysis, participants received a printed information sheet. Participants gave written consent by signing the form. For the online survey, each prospective participant read an electronic information sheet before beginning the online questionnaire. Participants provided consent by clicking the “I agree” button. The survey could not be accessed without this affirmative action. Participants received modest incentives, which were described earlier. For privacy and confidentiality protection, no direct identifiers were collected. Participants were automatically given a random study ID, and the response file contained only this ID and the survey answers. The deidentified dataset was stored on a password-protected hard drive, which was stored in a locked cabinet in a building requiring a security card for entry.
Results
Phase 1: Translation and Cultural Adaptation of the eHLQ
Initial Translation and Consensus Meeting
The initial translation was conducted with a focus on clarity and naturalness of expression. Consequently, for items containing the term “technology,” the type of technology was sometimes specified, such as “medical digital devices” or “online services,” depending on the context, to reduce vagueness and confusion. Prior to the consensus meeting, the Japanese translation and back translation were sent to the Danish eHLQ developer team for review and feedback.
During the meeting, each individual translated item was discussed to confirm that it was a faithful representation of the intended meaning of the original version. The discussion included selection of semantically appropriate Japanese vocabulary (eg, correspond vs adapt), a level of difficulty (eg, know vs be able to), and item intent (eg, “experience,” not “belief”). The Danish team explained that versatile vocabularies were intentionally chosen for longevity of use; therefore, as long as these made sense, the translation should be as close as possible to the original version. Accordingly, “medical digital devices” and “online services” were reverted to “technology.” Cultural adaptation was also considered while maintaining the intended meaning of each item. For example, in the item regarding participant data-sharing method, the Japanese version added the term “mainly” to indicate it does not mean “definitively always,” addressing the tendency of Japanese people to hesitate to clearly state abilities or preferences. In addition, words requiring confirmation during the cognitive interviews were listed. For example, the phrase “measurement about my body,” which was considered somewhat awkward, was confirmed to be checked in a subsequent cognitive interview to ensure it would be correctly understood. The revised version was developed after the consensus meeting, shared with the Danish team, and subsequently approved for use in the cognitive interviews.
Cognitive Interviews
A total of 12 people participated in the cognitive interviews, comprising 6 males and 6 females aged 19-77 years with diverse educational backgrounds and from diverse locations. The participants’ ages were 19, 29, 40, 65, 69, and 71 years for men and 23, 38, 40, 56, 62, and 77 years for women. A total of 6 interviews were conducted in person face-to-face, while the remaining 6 were performed online. Participants were from 9 different prefectures among the 7 regions of Japan, including 2 from remote islands. While most participants pointed out that some words or terms were unclear, 4 terms or phrases were frequently discussed.
eHealth system: The term “eHealth” was relatively new in Japan, so we initially translated it as “digital health system.” However, participants found it difficult to understand. Since “eHealth” was a novel term, participants gave close attention to it when it appeared in the instructions. Based on this feedback, we chose to use “eHealth system” (eヘルス・システム) in the Japanese version.
Technology: Initially, “technology” was translated directly, using the Japanese pronunciation. Many participants associated it with advanced medical technology, such as computed tomography and magnetic resonance imaging scans, rather than everyday digital technology, such as smartphones, internet services, home-use health devices, and so forth. To more accurately convey the concept, we replaced the term with “digital gijutsu” (デジタル技術), which translates to digital technology.
Best for me: In the English version of the eHLQ, this phrase means health care most suitable for the participant. It was first translated as “pittarina” (ぴったりな), a common colloquial term for “fits perfectly.” However, some participants found this vague, so we changed it to “saiteki” (最適), a more formal term meaning “best.”
My individual needs: Some participants questioned the meaning of this phrase, most likely because there is no direct equivalent in Japanese. After extensive discussions among the original eHLQ developers and Japanese translators, “kitai” (期待), meaning “expectations,” was adopted.
In contrast, the literal translation of “measurements about my body (自分の身体の測定値)” was initially thought to be difficult to understand because the combined use of the Japanese words “measurements” and “my body” sound somewhat awkward. However, all 12 participants understood it well. Therefore, this was not revised. The final Japanese version of the eHLQ can be found in .
Phase 2a: Psychometric Testing
Demographics and Digital Health Literacy Scores
Of the 785 participants who responded to the online survey, 444 completed the questionnaire. An additional 60 participants completed personal face-to-face interviews, yielding a total sample size of 504. Their mean age was 51.6 years (range 18‐88, SD 17.5), with 159 participants (31.5%) aged ≥65 years. Gender distribution included 257 (51.0%) male, 245 (48.6%) female, and 2 (0.4%) other. The participant recruitment flowchart is shown in , and participant demographics are summarized in .
Figure 2. Participant recruitment flow for psychometric testing of the Japanese version of the eHealth Literacy Questionnaire (eHLQ).
Table 1. Interview method and demographic variables of participants (n=504).
Characteristics
Participants, n (%)
Method (age range, y)
Online (18‐88)
444 (88.1)
In person face-to-face (65‐83)
60 (11.9)
Age (mean 51.6, SD 17.5)
18‐19 years
8 (1.6)
20s
72 (14.3)
30s
75 (14.9)
40s
77 (15.3)
50s
74 (14.7)
60s
96 (19.0)
70s
94 (18.7)
80s
8 (1.6)
≥65 years
159 (31.5)
Gender
Male
257 (51.0)
Female
245 (48.6)
Other (“nonbinary” and “other”)
2 (0.4)
Region of residence in Japan
Hokkaido (northernmost island)
22 (4.4)
Tohoku (northeast)
46 (9.1)
Kanto (includes Tokyo)
227 (45.0)
Chubu (central region)
53 (10.5)
Kinki (west central region)
73 (14.5)
Chugoku and Shikoku (western region)
52 (10.3)
Kyushu and Okinawa (southern region)
31 (6.2)
Degree of urbanization
Special wards (Tokyo’s 23 wards)
83 (16.5)
Ordinance-designated city
139 (27.6)
City
259 (51.4)
Town or village
23 (4.6)
Education
Junior high school (ISCED level 2)
10 (2.0)
High school (ISCED level 3)
153 (30.4)
Vocational school (ISCED level 5)
50 (9.9)
Junior college (ISCED level 5)
49 (9.7)
Technical college (ISCED level 5)
1 (0.2)
University (bachelor’s degree) (ISCED level 6)
216 (42.9)
Graduate school (master’s degree) (ISCED level 7)
19 (3.8)
Graduate school (doctoral degree) (ISCED level 8)
6 (1.2)
Working hours per week
Unemployed or retired
151 (30.0)
Less than 20 hours
67 (13.3)
20‐39 hours
52 (10.3)
Full-time (40+ h)
189 (37.5)
Working hours vary widely from week to week
20 (4.0)
Other than above (eg, student)
22 (4.4)
Unknown
3 (0.6)
Type of ICT use (at least once a week)
Internet services
471 (93.5)
Digital health services via internet
125 (24.8)
Social network services
383 (76.0)
Computer
390 (77.4)
Smartphone
450 (89.3)
Mobile phone other than smartphone
39 (7.7)
Tablet computer device
112 (22.2)
Internet access via TV
112 (22.2)
Home game consoles
93 (18.5)
Other
43 (8.5)
Self-rated health status
Very good
40 (7.9)
Good
137 (27.2)
Normal
257 (51.0)
Bad
65 (12.9)
Very bad
5 (1.0)
aISCED: International Standard Classification of Education.
bICT: information and communication technology.
cExamples given to the participants: internet browsing, searching, shopping, using email, and so forth.
dExamples given to the participants: booking appointments for a clinic, searching for medical information, using health apps on smartphones, and so forth.
eExamples given to the participants: Facebook, LINE, Twitter (X), mixi, Instagram, and so forth.
fExamples given to the participants: desktop computer, laptop computer.
gExamples given to the participants: iPad, E-reader, and so forth.
hExamples given to the participants: PlayStation, and so forth.
The mean eHLQ scores of the 7 scales ranged from 2.72 to 2.30. Participants reported the highest scores on item 35 in scale 5 (Motivated to engage with digital services) and the lowest on item 16 in scale 6 (Access to digital services that work). The summary of the results is shown in .
Table 2. Descriptive and psychometric properties of the 7 scales of the Japanese version of the eHealth Literacy Questionnaire (eHLQ; n=504).
dScale 1: Using technology to process health information.
eScale 2: Understanding of health concepts and language.
fScale 3: Ability to actively engage with digital services.
gScale 4: Feel safe and in control.
hScale 5: Motivated to engage with digital services.
iScale 6: Access to digital services that work.
jScale 7: Digital services that suit individual needs.
Reliability
Internal consistency determined using Cronbach α exceeded 0.80 for all scales except for scale 2 (Understanding of health concepts and language), which scored 0.78, indicating reliability from acceptable to good ().
Construct Validity
A 1-factor CFA analysis showed good fit for the Japanese eHLQ across all scales based on CFI (≥0.96) and SRMR (≤0.04) values (). All items had significant factor loadings (≥0.50) (), with SEPC values for all 7 scales being <0.25, and 5 of these scales having SEPC values <0.20 (). Interfactor correlations were analyzed using a 7-factor model, showing a suitable range of 0.26-0.59. The model diagram is shown in .
Item Response Theory
IRT analysis demonstrated that estimated item locations were generally well distributed, except for items 6 and 8 in scale 3 (Ability to actively engage with digital services), and items 24 and 35 in scale 5 (Motivated to engage with digital services). Item discrimination values were all >0, being from 1.03 to 3.72, with the narrowest and widest range noted for item 14 (0.74‐1.31) and item 31 (2.73‐4.71) (). Boundary characteristic curves indicated difficulty parameters around 0.5 on the latent trait scale, with slope steepness showing good item fit ().
Phase 2b: Descriptive Analysis
Demographic Group Comparisons of the eHLQ Scores
Participants were grouped by demographic characteristics for further analysis, essentially between-groups comparisons. Regional and gender classifications showed differences in 1 scale each; however, post hoc analysis did not identify any specific group differences. Degree of urbanization and education level showed differences in 2 scales; these were also observed in the post hoc analysis, though effect sizes were small. The working hour classification showed differences in 4 scales, with the effect size of scale 6 (Access to digital services that work) being >0.75. The comparison showing large effect size was between “unemployed or retired” and “other than above (eg, student),” and that of medium effect size was between “20‐39 hours per week” and “other than above (eg, student),” with the “other than above (eg, student)” group having higher mean scores. Age groups in 10-year ranges showed differences in all but scale 2 (Understanding of health concepts and language); post hoc analysis confirmed these findings, with 5 scales showing effect sizes considered worth discussing. The most frequently observed comparisons showing medium or large effect sizes were between the age 20s and 60s groups (3 scales) and between the age 50s and 70s groups (2 scales), with the 20s and 70s groups having higher mean scores. Differences in eHLQ scores by self-reported health status were observed across all 7 scales, with medium effect sizes found in the comparisons between “‘very good or good” and “bad or very bad” in 5 of the 7 scales. The group comparisons of eHLQ scores across demographic variables, the results of post hoc analysis, point estimates with 95% CIs, and effect sizes are presented in .
Two group comparisons revealed that those aged ≥65 years scored higher on 3 scales, compared to those aged <65 years; however, the effect sizes were all small. Individuals who reported that they used the internet at least once a week scored higher on scales 1 and 3, which both related to information and media literacy, with scale 3 showing a medium effect size. In contrast, differences were found across all 7 scales, with medium effect sizes observed among people who used digital health services at least once a week compared to those who used them less frequently. Participants with chronic disease(s) scored higher on 3 scales, but those effect sizes were small. The results of the 2-group comparisons are summarized in .
Table 3. Two-group comparisons of the Japanese version of the eHealth Literacy Questionnaire (eHLQ) scores across various demographics (n=504).
eHLQ scores
Scale
1
2
3
4
5
6
7
Age (y)
<65 (n=345)
2.49
2.51
2.44
2.50
2.67
2.31
2.40
≥65 (n=159)
2.42
2.65
2.33
2.64
2.83
2.28
2.42
Mean difference
0.07
−0.14
0.10
−0.13
−0.15
0.03
−0.02
95% CI lower
−0.04
−0.23
−0.01
−0.22
−0.24
−0.07
−0.12
95% CI upper
0.18
−0.01
0.21
−0.05
−0.06
0.12
0.08
P value
.20
<.01
.07
<.01
<.01
.59
.68
Levene test (sig.)
.60
<.01
.21
<.01
<.01
<.01
.03
Effect size
0.12
0.30
0.18
0.26
0.30
0.05
0.04
ICT use (internet)
At least once a week (n=471)
2.48
2.57
2.42
2.55
2.73
2.30
2.41
Less than once a week (n=33)
2.26
2.41
2.12
2.47
2.61
2.26
2.38
Mean difference
0.22
0.15
0.30
0.08
0.12
0.05
0.03
95% CI lower
0.02
−0.08
0.10
−0.10
−0.07
−0.14
−0.17
95% CI upper
0.42
0.39
0.51
0.27
0.30
0.24
0.23
P value
.03
.20
<.01
.37
.22
.63
.79
Levene test (sig.)
.13
.02
.34
.08
.19
.69
.76
Effect size
0.39
0.31
0.53
0.16
0.22
0.09
0.05
ICT use (digital health services)
At least once a week (n=125)
2.76
2.79
2.64
2.78
2.95
2.57
2.69
Less than once a week (n=379)
2.37
2.48
2.33
2.47
2.64
2.21
2.31
Mean difference
0.39
0.31
0.31
0.31
0.31
0.36
0.38
95% CI lower
0.28
0.22
0.20
0.21
0.21
0.26
0.27
95% CI upper
0.51
0.41
0.43
0.41
0.41
0.47
0.49
P value
<.01
<.01
<.01
<.01
<.01
<.01
<.01
Levene test (sig.)
.04
.17
.09
<.01
.06
.59
.16
Effect size
0.72
0.66
0.55
0.63
0.61
0.71
0.71
Chronic diseaswe
With chronic disease(s) (n=194)
2.44
2.62
2.39
2.64
2.78
2.32
2.43
No chronic disease (n=310)
2.48
2.51
2.41
2.48
2.68
2.29
2.39
Mean difference
−0.04
0.10
−0.02
0.16
0.10
0.03
0.05
95% CI lower
−0.14
0.02
−0.12
0.07
0.01
−0.07
−0.05
95% CI upper
0.06
0.19
0.09
0.25
0.19
0.12
0.15
P value
.44
.02
.78
<.01
.03
.59
.36
Levene test (sig.)
.54
.10
.79
.02
.21
.07
.63
Effect size
0.07
0.21
0.03
0.31
0.20
0.05
0.08
aScale 1: Using technology to process health information.
bScale 2: Understanding of health concepts and language.
cScale 3: Ability to actively engage with digital services.
dScale 4: Feel safe and in control.
eScale 5: Motivated to engage with digital services.
fScale 6: Access to digital services that work.
gScale 7: Digital services that suit individual needs.
hICT: information and communication technology.
Discussion
Principal Results
The Japanese eHLQ was translated from the English version, and its reliability was assessed through psychometric analysis. Data collected from a representative sample of Japan aged from 18 to 88 years were analyzed using classical test theory, IRT, and comparative statistical methodologies. The results indicated the Japanese eHLQ has strong-to-acceptable measurement reliability. Comparative analyses of demographic factors revealed that scores across all 7 scales differed among groups classified by self-reported health status and between groups classified by frequency of digital health service use. Age groups showed differences on 6 scales; however, 2-group comparisons (≥65 y vs <65 y) revealed the elderly scored higher on scales 2, 4, and 5, albeit the effect sizes were small.
Psychometric Analysis
Classical test theory and IRT analyses indicated the instrument was satisfactory or acceptable. All 35 items exhibited standardized loadings above 0.50, indicating that each item strongly represents its respective scale. A potential concern is that the eHLQ has some substantial interfactor correlations [,], which may be related to the high factor loadings. According to Kayser et al [], those correlations are likely caused by the scales sharing the same causal pathway while measuring different constructs. Since content differentiation among the scales has been theoretically supported [,,], this is unlikely to compromise the interpretation of the scale scores.
Regarding IRT analysis, items 6 and 8 on scale 3 were less well distributed compared to other items. These 2 items assess different levels of difficulty regarding the ability to engage with digital services. While item 6 assesses general knowledge of digital technology, item 8 evaluates practical performance ability with the technology. The results indicate that among Japanese participants, those with knowledge of digital technology overlapped with those who could use the technology. Other poorly distributed items included 24 and 35 on scale 5. These items assess motivation to engage with digital services and evaluate expectations of digital technologies, namely one for receiving services and the other for utilizing them. Since scores for scale 5 were generally high, this result may reflect characteristics of Japanese people.
Among the top 3 items with the highest item locations (items 14, 16, and 29), items 16 and 29 focus on digital health services that are either unavailable or have very limited availability in Japan, making them challenging. Item 14 examines the acquisition of advanced understanding sufficient to utilize health data in health care settings, which may have made participants reluctant to respond “agree” or “strongly agree.” Given that the eHLQ contains items with different difficulty levels, item 14 may help distinguish participants in greater detail. Apart from these items, the IRT analysis showed well-distributed responses.
Relationship Between eHLQ Scores and Participant Demographic Factors
Differences in eHLQ scores were examined across demographic variables. Several studies have reported that education level is associated with both ICT use and digital health literacy [-]. In this study, score differences were observed between the education level groups on some scales, particularly scales 1, 2, and 3, which examine participants’ skills and knowledge, and scale 6, which is associated with participants’ experiences with digital health services. However, the differences were minimal as indicated by small effect sizes. Furthermore, scores on scales 4, 5, and 7, which examine participants’ beliefs, motivations, perceptions, and expectations regarding digital health services, showed no differences. This partial effect of education level on eHLQ scores in Japan differs from findings in Taiwan and Serbia, where education level affected all 7 scales [,].
Analysis of internet use frequency and eHLQ scores revealed that individuals who used the internet at least once a week scored higher on 2 scales (), with the effect size for scale 3 (Ability to actively engage with digital services) being medium. However, since 93.5% (471/504) of the participants reported using internet services at least once a week, internet use alone may not be a reliable indicator of digital health literacy. In contrast, differences were observed across all 7 scales with medium effect sizes when comparing participants by their frequency of digital health service use. Since using digital health services can enhance digital health literacy [], these differences may become more significant over time.
Age is a known predictor of digital health literacy [,], which was also observed in this study (). However, when dividing participants into 2 age groups, the analysis revealed that differences between those under and those over 65 years old showed small effect sizes across all 7 scales (). These results may have been influenced by the multifaceted nature of the eHLQ assessment tool. Using an instrument that emphasizes internet operating skills and technological knowledge might yield different results. Additionally, older adults in Japan lived through Japan’s period of rapid economic growth, during which they witnessed remarkable technological advancements. As a result, even if their personal digital skills are limited, they may still hold positive attitudes toward digital technology.
Those who rated their health status as “very good” or “good” scored higher on all 7 scales, with 5 scales showing medium effect sizes. This result is consistent with a previous eHLQ study []. Interestingly, the 2-group comparison between participants with and without chronic disease(s) did not show notable differences. This suggests that self-reported health status was more important than actual disease status in relation to eHLQ scores ( and ).
Implications for Practice: Digital Health Services for Japan’s Super-Aged Society
Although implementation was not within the scope of this study, the following considerations may inform health care workers, system developers, and policy makers, as well as future research development.
Neither being over 65 years old nor having chronic disease(s) was linked to low eHLQ scores. Self-management has been proven as a strategy for chronic diseases care [], and several digital tools, such as apps, wearable devices, and remote monitoring systems, are available for this purpose []. Since digital health services are recommended for self-management of NCDs, or chronic diseases [,], employing these technologies for patients with chronic disease(s) may help alleviate the burden on the overloaded Japanese health care system [].
There are some potential risks to consider when promoting digital health services in Japan. Comparing 10-year age groups, participants in their 50s and 60s tended to score lower than other age groups (). Despite being relatively familiar with the internet [], these groups may still need support in using digital health services. While older adults often rely on younger family members for assistance in using digital services—a tendency that was also frequently noted during face-to-face interviews—these supporting generations may not always be able to provide adequate help. Another concern is the perception of security and safety in digital technology, assessed using scale 4 (feel safe and in control). Agreeing or strongly agreeing with those items would typically require some ICT knowledge. However, this result warrants caution, especially for people who scored well or average on scale 4 despite limited ICT usage. During face-to-face interviews with elderly people, the author (YM) observed participants often mumbling phrases like “It’s supposed to be” or “I want to believe so” while responding to items regarding security and safety. This could be due to Japanese cultural traits, such as hierarchical and conformist tendencies, which may inhibit critical thinking, so high scores on those items may be unreliable measures of understanding internet security []. Nevertheless, while a high score on scale 4 should not be seen as a barrier to facilitating digital health services, health care workers should be aware that users with high scores on scale 4 may still be vulnerable to internet security risks and not openly express concerns.
This study revealed that people who used digital health services at least once a week had higher digital health literacy. IRT analysis demonstrated response scores of individuals who reported technological knowledge overlapped with those who reported capability in using technology. This factor warrants caution, as current systems may be tailored to users with sufficient digital health literacy and may be unsuitable for those who do not regularly use these services. Developers of digital health services should aim to avoid complexity that requires high digital health literacy. Instead, technology should be designed to accommodate user expectations and compensate for gaps in skills, knowledge, or user experience—areas that can be assessed using the eHLQ. Due to advancements in information and communication technologies, required digital health literacy is rapidly changing. Since we took great care to maintain the versatile vocabulary that eHLQ uses to ensure its longevity, the instrument is expected to help monitor digital health literacy in Japan in the coming years.
Limitations
The survey excluded income level, since asking about income is considered impolite in Japanese culture []. Income questions might create a barrier between participants and researcher, particularly in face-to-face interviews. Among the 159 participants aged ≥65 years old, 99 (62.3%) participated online. To do so, they had previously registered with the survey platform, meaning they likely had better access to ICT than others in the same age group. For the face-to-face survey, participants recruited through human resource centers for older adults may have had less cognitive impairment and better overall health status compared to average for their age group. Further investigation of older adults who require physical and cognitive support is necessary for a more comprehensive understanding of the impact of age on digital health literacy in Japan. Although the current university enrollment rate in Japan is 57.7%, the population in this study may have held a higher education level than the general Japanese population, with 47.8% (241/504) of the participants having International Standard Classification of Education levels >6.
Conclusions
Psychometric analysis showed that the Japanese version of the eHLQ is likely a reliable and effective tool for assessing digital health literacy in Japan. There were no notable differences between scores of those aged above and below 65 years, or those with and without chronic disease(s), as indicated by small effect sizes. Service providers should be aware of users’ digital health literacy—including skills, knowledge, expectations, and perceptions—as assessing these aspects is important for effectively promoting such services. The Japanese version of the eHLQ is well suited for assessing digital health literacy and is expected to be used to monitor this literacy and identify additional support needs, thereby potentially contributing to the health care system in Japan.
The authors thank Prof. Lars Kayser and Josefine Christensen for chairing the consensus meetings and providing key guidelines during the process of translating and culturally adapting the Japanese version of the eHealth Literacy Questionnaire, as well as for offering insightful comments on the manuscript. We thank Prof. Richard Osborne for his suggestions on the psychometric analysis. We also thank Kensuke Sato for technical support with Research Electronic Data Capture. Finally, we thank David Price of English Services for Scientists based in Hiroshima for proofreading. Generative artificial intelligence (AI) was used to improve the writing. The manuscript was first drafted by the authors and improved by AI-powered writing assistance, Grammarly (Grammarly, Inc., USA), and DeepL Write (DeepL, Germany). ChatGPT was occasionally used to consider better wordings and smooth sentences. The entire manuscript was then reviewed by a professional English proofreader. The authors have read the final version and approved it. The authors did not use generative AI tools for conceptualization, study design, reference searches, data analysis, tabulation, or figure creation.
This study was financially supported by JSPS KAKENHI Grant Number (23H05361), research fund from Murata Science and Education Foundation, research fund from the Taiyo Life Welfare Foundation, and Juntendo University Faculty of Health Care and Nursing research funds.
Questionnaire License Agreement: Swinburne University of Technology manages licenses to use the Japanese eHLQ. To use the tool, please contact Ms. Kerrie Paulger at kpaulger@swin.edu.au.
This research was funded by the Murata Science and Education Foundation, supported by Murata Manufacturing Co., Ltd., a developer of electronic devices, including digital health products. HD has received research funding from Imasen Electric Industrial Co. Ltd., Fujifilm Corporation, Philips Japan, Inter Reha Co. Ltd., Fukuda Denshi Co. Ltd., Kyocera Corporation, and AMI Co. Ltd.
Edited by Naomi Cahill; submitted 28.Nov.2024; peer-reviewed by Esther Metting, Richard Osborne; final revised version received 20.Oct.2025; accepted 27.Oct.2025; published 26.Nov.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Pain is defined as “an unpleasant sensory and emotional experience associated with, or resembling actual or potential tissue damage” []. In pediatric health care, pain is one of the most frequently reported concerns, and when inadequately managed, it may lead to long-term physical, psychological, and developmental consequences [,]. These risks underscore the urgent need for effective and safe pain management strategies tailored for children.
Current clinical recommendations emphasize multimodal approaches that integrate both pharmacological and nonpharmacological strategies to optimize outcomes in the pediatric population [,]. Pharmacologically, ibuprofen is the most extensively studied nonsteroidal anti-inflammatory drug and is widely recognized for its efficacy and safety in acute pediatric pain []. However, best practice not only achieves effective analgesia but also aims to minimize risks by reducing overreliance on pharmacological interventions and incorporating evidence-based nonpharmacological approaches [,].
In this context, socially assistive robots (SARs) have emerged as a promising nonpharmacological intervention for alleviating pain and mitigating emotional distress in pediatric health care settings [-]. Through features such as embodiment, personalization, empathy, and attentional distraction, SARs provide emotionally supportive interactions without requiring physical contact []. Evidence indicates that SARs can reduce procedural pain, anxiety, and distress while promoting positive affect and supporting postoperative recovery [-].
This potential is particularly relevant in hospital environments, where children frequently undergo painful and distressing medical procedures, such as injections, blood draws, surgeries, and cancer treatments [-]. Inadequately managed pain and distress in these settings may contribute to delayed recovery, prolonged hospitalization, long-term psychological sequelae, and reduced treatment adherence []. Compared with outpatients, hospitalized children are more often exposed to repeated and invasive procedures, making effective emotional support and pain management especially critical [].
Despite the growing interest, most existing systematic reviews of SARs have focused on outpatient applications, particularly in mental health or short-term procedural contexts, such as vaccinations and dental visits [,,,]. A few meta-analyses have examined SARs in clinical settings for outcomes such as anxiety [], pain and negative affect during needle-based interventions [], and psychological well-being []. Emotional responses are inherently subjective experiences [,]. However, previous meta-analyses included a blend of observer-rated and self-reported outcome measures. This study prioritized children’s self-reports, which are more accurately captured through their own perspective.
Furthermore, research on human-robot interaction highlights that the clinical implementation of SARs requires careful consideration of ethical dimensions, such as safety, privacy, and autonomy [,]. Ethical concerns also include children’s potential emotional overdependence, unintentional attachment, and reduced meaningful human interaction, which are especially salient for younger patients undergoing emotional and social development [,]. However, these dimensions have received limited systematic attention in pediatric care.
To address these gaps, this systematic review with meta-analysis synthesizes findings exclusively from randomized controlled trials (RCTs) that evaluated the effectiveness of SARs in reducing pain and emotional outcomes, including anxiety, fear, and distress, among pediatric patients in hospital settings. In addition, this study provides a comprehensive synthesis of intervention design and contextual factors for future RCTs, ultimately improving clinical outcomes and enhancing children’s hospital experiences.
Methods
Study Design
This review was prospectively registered in the PROSPERO (International Prospective Register of Systematic Reviews; CRD420251026751). This study followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 guidelines [] and the PRISMA-S (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Literature Search Extension) extension for literature searches (checklist provided in ) []. The search strategy was peer reviewed by a senior medical librarian before execution using the PRESS (Peer Review of Electronic Search Strategies) guidelines to ensure transparency, reproducibility, and methodological rigor []. Two reviewers independently conducted the study selection, risk of bias assessment, certainty of evidence appraisal, and data extraction. Discrepancies were resolved through discussions with a third reviewer and the corresponding author.
Eligibility Criteria
This review included RCTs that met the following eligibility criteria according to the PICO framework: (1) population (P): participants were children <19 years of age in hospital settings; studies focusing on children diagnosed with autism spectrum disorder were excluded, as previous research has already established the efficacy of SARs in this population []; (2) intervention (I): involved the use of SARs, excluding studies focused on rehabilitation, training, or surgical applications; (3) comparison (C): studies included control or alternative intervention; and (4) outcomes (O): the primary outcome was pain. Secondary outcomes were emotion-related responses.
Information Sources
A total of 8 electronic databases across 5 platforms were searched to identify relevant studies: PubMed (National Library of Medicine), MEDLINE (National Library of Medicine), Embase (Elsevier), Cochrane Library (Wiley), Scopus (Elsevier), IEEE Xplore Digital Library (IEEE Xplore), Health & Medical Collection (ProQuest), and ProQuest Dissertations & Theses A&I (ProQuest). To identify additional gray literature and unpublished studies, we searched the study registry ClinicalTrials.gov and manually screened conference proceedings from the Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction. Both cited and citing references of relevant systematic reviews were examined by browsing their reference lists and using Google Scholar’s (Google LLC) citation function to identify additional eligible studies.
Search Strategy
An iterative search strategy was developed following the PRISMA-S extension for the transparent and reproducible reporting of literature searches. The strategy combined Medical Subject Headings, related terms, and free-text keywords using Boolean operators to optimize the sensitivity and specificity. Search concepts were informed by the PICO framework and included terms related to “hospitalization,” “child,” “social robot,” “pain,” “distress,” “emotion,” “anxiety,” “fear,” and “well-being.” The search syntax was subsequently adapted to each database’s indexing system. The initial search was conducted on May 6, 2025, and updated on October 7, 2025, by rerunning the searches. No language or publication date restrictions were applied. The details of the search strategies, including full line by line search strings, filters, parameters, search dates, and retrieval counts, are presented in .
Selection Process
All references were imported into EndNote (version 21; Clarivate), and the duplicates were automatically removed. Titles and abstracts were independently screened by 2 reviewers, followed by full-text assessments based on predefined eligibility criteria. The reasons for exclusion are documented in . The overall selection process is illustrated in the PRISMA flow diagram in the Results section.
A total of 1229 records were retrieved from 8 databases and 1 from citation searching. After removing 216 duplicates and screening titles or abstracts, 80 full texts were assessed. After 67 were excluded due to not meeting the criteria, 13 studies were included, with 7 providing sufficient data for meta-analysis.
Quality Assessment
The methodological quality of the included RCTs was evaluated using the short version of the revised Cochrane Risk of Bias tool for randomized trials []. The risk of bias was assessed across 5 domains: randomization process, deviations from intended interventions, missing outcome data, outcome measurement, and selection of reported results. Each domain was rated as “low risk,” “some concerns,” or “high risk” of bias, and an overall judgment was made.
Certainty of Evidence
The certainty of evidence for each outcome was assessed using the GRADE (Grading of Recommendations, Assessment, Development, and Evaluation) approach []. Five domains were evaluated: risk of bias, inconsistency, indirectness, imprecision, and publication bias. Outcomes were rated as “high,” “moderate,” “low,” or “very low” certainty of evidence. The ratings were generated using the GRADEpro Guideline Development Tool [].
Data Extraction and Synthesis
The data extraction included study characteristics such as authors, year of publication, country, study objectives, sample size, study population, participant age, setting, type of SARs, intervention details, comparator, measurement tools, and main findings. All the included studies contributed to the narrative synthesis. For the meta-analysis, only studies that provided sufficient numerical data were eligible for pooling, regardless of whether the outcome was primary (pain) or secondary (emotional responses). Where such data (eg, means, SDs, and sample sizes) were incomplete, we attempted to contact the original study authors to obtain additional information. Data synthesis was conducted in two parts: (1) narrative synthesis, summarizing key characteristics and findings of all included studies; and (2) meta-analysis, performed for outcomes with adequate quantitative data.
Data Analysis
Meta-analyses were conducted using R version 4.2.1 (R Project for Statistical Computing). Pooled effect sizes were estimated using a random-effects model to account for anticipated heterogeneity []. The outcomes included pain, anxiety, distress, and fear. For each outcome, differences in means with corresponding 95% CIs were calculated to accommodate variability across measurement scales. Subgroup analyses or meta-regression were planned in the presence of substantial heterogeneity. Given the limited number of studies, the Hartung-Knapp-Sidik-Jonkman method was applied to adjust the SEs []. Between-study heterogeneity was quantified using the inconsistency index (I²), between-study variance (τ²) and SD (τ), and 95% prediction intervals (PI) were reported to indicate the expected range of effects in future studies, except for outcomes with very few studies []. Forest plots were generated to visualize the pooled effect sizes. Funnel plots were constructed to assess the small-study effect. As recommended, Egger test was not performed for outcomes with fewer than 10 studies because of its low statistical power to detect true asymmetry [,].
Results
Literature Search
As illustrated in , a total of 1229 records were retrieved from 8 electronic databases (), with no additional records retrieved through other methods. After removing 216 duplicates, 1013 records remained for review. Title and abstract screening excluded 933 papers based on the predefined inclusion and exclusion criteria, resulting in 80 papers for full-text reviews. Of these, 67 were excluded because they did not meet the eligibility criteria (). Ultimately, 13 RCTs were included in this review. The details of the search strategies are presented in .
Figure 1. PRISMA flow diagram for the literature search. A total of 1229 records were retrieved from 8 databases and 1 record from citation searching. After removing 216 duplicates and screening titles or abstracts, 80 full texts were assessed. After 67 studies were excluded due to not meeting the criteria, 13 studies were included, with 7 studies providing sufficient data for meta-analysis. PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses.
Characteristics of Included Studies
The characteristics of the 13 included RCTs are shown in . All studies were published between 2013 and 2023 and were conducted in 6 countries: Canada, the United States, Italy, Iran, Turkey, and Taiwan. A total of 619 participants were enrolled (intervention group: 301 and control group: 318), with individual study sample sizes ranging from 11 to 103. Participants were aged 2-19 years, most of whom were of school age, and all were in pediatric hospital settings due to acute illness, chronic disease, or surgical procedures. Additionally, the settings in which the interventions were implemented were diverse. Two trials were conducted in emergency departments [,], 2 in surgical wards and operating rooms [,], 2 in oncology units or hematology clinics [,], 3 in pediatric wards [-], 1 in a postanesthesia care unit [], 1 in a radiology department [], 1 in a hospice unit [], and 1 in a hospital-based game room [].
Table 1. Characteristics of the included RCTsa, including author, publication year, country, study objectives, number of participants, participant characteristics, settings, measurements, and main results.
Author (year), country
Objectives
Number of participants (IGb/CGc)
Study population
Age (years)
Setting
Measurements
Main results
Alemi et al (2016) [], Iran
Exploring the effect of SARsd as a therapy-assistive tool
6/5
Children with cancer receiving active therapy
7-12
Oncology unit in the hospital
MASCe, CDIf, and CIAg
Improved anxiety, anger, and depression with emotional support.
Ali et al (2021) [], Canada
Effect of SARs during the invasive procedure
43/43
Require intravenous insertion
6-11
Emergency department
FPS-Rh and OSBD-Ri
Reduced distress; none in pain.
Beraldo et al (2019) [], Italy
Potential of SARs during invasive medical procedures
14/14
Inpatients prepared for invasive procedures (eg, spinal tap)
3-19
Hospice unit in the hospital
Emotion questionnaire
Overall, reduced negative feelings, increased positive emotions. Most rated the experience positively.
Chang et al (2023) [], Taiwan
Impact of SARs-assisted digital storytelling of intravenous procedure
26/26
Inpatients with intravenous access
5-10
Pediatric general ward in the hospital
MYPASj
Reduced anxiety and improved therapeutic communication, emotions, and engagement.
Franconi et al (2023) [], Italy
Potential of SARs during the preoperative preparation
30/30
Preparing to undergo surgery
2-14
Pediatric surgical ward and operating room in the hospital
CEMSk
The intervention group showed significantly lower anxiety levels.
Jibb et al (2018) [], Canada
Impact of SARs during subcutaneous port access insertion
19/21
Children with cancer and a subcutaneous port underwent active therapy
4-9
Hematology clinic in a pediatric hospital
FPS-R, CFSl, and BAADSm
SARs were acceptable, but had no effect on pain or distress.
Lee-Krueger et al (2021) [], Canada
Effect of SARs support during intravenous induction
45/58
Required intravenous insertion before surgery
4-12
Operating room in a pediatric hospital
FPS-R and CFS
No significant differences in pain or fear across groups.
Logan et al (2019) [], United States
The feasibility and acceptability of SARs technology
13/16
Inpatient over 48 hours with cancer or surgery
3-10
General and hematology-oncology ward in a hospital
FPS-R, NRSn, FASo,PANAS-Cp, and STAI-Cq
Children exposed to SARs reported more positive emotion. SARs were mostly acceptable.
Meghdari et al (2018) [], Iran
Acceptability and involvement of SARs assistance
7/7
Children with cancer receiving active therapy
5-12
Game room in the hospital
TS-SFr and SAMs
Revealed high engagement and interest of pediatric patients with cancer with the SARs.
Okita (2013) [], United States
Potential of SARs companions and involvement with family
9/9
Hospitalized female children
6-16
General ward in a hospital
WBFPRSt and STAI-C
Significant reduction in pain and anxiety when children and parents engaged with SARs together.
Rossi et al (2022) [], Italy
Exploring the impact of SARs on stress before medical procedures
36/37
Waiting to access the medical office
3-10
Emergency department
Salivary cortisol levels and heart rate
Significant decrease in salivary cortisol levels and heart rate. The effect was stronger in girls.
Topçu et al (2023) [], Turkey
Effect of SARs on the postoperative recovery
42/42
Underwent day surgery
5-10
Postanesthesia care unit in a hospital
CSAu
Significant group differences in postoperative anxiety and mobilization time.
Trost et al (2020) [], United States
Impact of an empathic SARs during intravenous insertion
11/10
Required intravenous insertion before MRIv
4-14
Radiology department in a hospital
WBFPRS and CFS
Pain and fear significantly decreased over time.
aRCT: randomized controlled trial.
bIG: intervention group.
cCG: control group.
dSAR: socially assistive robot.
eMASC: Multidimensional Anxiety Children Scale.
fCDI: Children’s Depression Inventory.
gCIA: Children’s Inventory of Anger.
hFPS-R: Faces Pain Scale-Revised.
iOSBD-R: Observed Scale of Behavioral Distress-Revised.
jMYPAS: Modified Yale Preoperative Anxiety Scale.
kCEMS: Children’s Emotional Manifestation Scale.
lCFS: Child Fear Scale.
mBAADS: Behavioral Approach-Avoidance Scale.
nNRS: Numeric Rating Scale.
oFAS: Facial Affective Scale.
pPANAS-C: Positive and Negative Affect Scales for Children.
qSTAI-C: State-Trait Anxiety Inventory for Children.
rTS-SF: Transportation Scale-Short Form.
sSAM: Self-Assessment Manikin Questionnaire.
tWBFPRS: Wong-Baker FACES Pain Rating Scale.
uCSA: children’s state anxiety.
vMRI: magnetic resonance imaging.
Design of SARs Interventions and Comparators
The included interventions varied in terms of timing, frequency, and technological features (). Six studies implemented SARs before or during invasive procedures [,,,,,], 4 addressed broader hospital experience contexts [,,,], 2 focused on preoperative care [] or postoperative care [], and 1 was conducted before a noninvasive procedure []. The intervention duration ranged from 3 to 40 minutes; 11 studies used a single session, while 2 adopted repeated sessions [,]. SARs primarily provide distraction, cognitive behavioral strategies, and emotional companionship. Technical difficulties were reported in 4 studies [,,,], mainly due to connectivity or hardware malfunctions, with rates ranging from 9% (4/46) to 60% (26/43).
Table 2. Summary of interventions and comparators, including type of SARsa, characteristics of intervention design, type of comparators, duration of intervention, and technical difficulties.
Author (year)
Type of SARs
Interventions
Comparators
Duration
Follow-up
Technical difficulties
Alemi et al (2016) []
NAO
The hybrid-operated SARs engaged children through specific dialogue with a psychologist
Alternative intervention (only with a psychologist)
5 min
8 sessions
None reported
Ali et al (2021) []
NAO
The SARs were programmed with self-introduction, breathing guidance, and dance during intravenous insertion
Standard care
5-10 min
No
Occurred in 60% (26/43): connectivity, delays, tablet freezing, volume issues, shutdowns, or falls
Beraldo et al (2019) []
Pepper
The hybrid operative SARs interacted with dialogue, gestures, games, and music during invasive procedures
Alternative intervention (Sanbot robot)
Not reported
No
None reported
Chang et al (2023) []
Kebbi
Preprogrammed with digital storytelling during intravenous insertion
Standard care
40 min
No
None reported
Franconi et al (2023) []
NAO
Through hybrid operative programs of speech, singing, and play, and distracted attention before surgery
Standard care
Not reported
No
None reported
Jibb et al (2018) []
NAO
SARs were preprogrammed with CBTb strategies such as deep breathing and encouragement during subcutaneous port insertion
Alternative intervention (active distraction with NAO)
7-10 min
No
35% (14/40): connection loss, phrase repetition
Lee-Krueger et al (2021) []
NAO
The SARs were preprogrammed to guide deep breathing exercises before intravenous induction for surgery
Standard care
5-20 min (mean 10 min)
No
None reported
Logan et al (2019) []
Huggable bear
Teleoperation to interact with children through speech, games, and touch
Alternative intervention (plush teddy bear)
9-40 min (mean 26 min)
No
9% (4/46): wireless interference, delays, malfunctions, and speaker failure
Meghdari et al (2018) []
Arash
Telling stories through preprogrammed dialogue, expression, and gesture
Alternative intervention (an audiobook with the same stories)
3 min
No
None reported
Okita (2013) []
Paro
Accompanied by mom and interacted with autonomous SARs through contact
Alternative intervention (alone with the SARs)
30 min
No
None reported
Rossi et al (2022) []
NAO
The hybrid SARs engaged children with songs, stories, jokes, and riddles before the medical procedure
Standard care
15 min
No
Background noise or mispronunciation required teleoperation
Topçu et al (2023) []
Macrobot
In postoperative recovery, autonomous SARs encouraged and accompanied children during mobilization
Alternative intervention (nurses)
4-10 min
3 sessions
None reported
Trost et al (2020) []
MAKI
During intravenous insertion, the SARs provided empathetic responses
Standard care
Not reported
No
None reported
aSAR: socially assistive robot.
bCBT: cognitive behavioral therapy.
Across the 13 included RCTs, 6 studies compared the SARs interventions with standard hospital care. The remaining 7 studies used diverse comparators, including psychologist-led therapy [], another robotic platform [], an alternative SARs-based distraction program [], a plush teddy bear [], audiobooks delivering the same narratives [], being alone with the SARs [], and nurse-led postoperative recovery []. These variations in comparator conditions illustrate the heterogeneity of approaches in contextualizing the role of SARs in pediatric care.
Nine types of SARs were used in the included studies (). Their physical appearances can be broadly categorized as humanoid (eg, NAO byAldebaran, Pepper bySoftBank, and Arash), animal-like (Huggable and Paro by National Institute of Advanced Industrial Science and Technology), or robot-like (Sanbot by Sanbot, Kebbi by Nuwa, MAKI, and Macrobot by Silverlit). Most SARs interacted with children using voice and gestures, and visual aids through camera input. Humanoid robots typically feature advanced functions, such as facial expression recognition and tactile feedback. The operational modes varied across autonomous, hybrid, and teleoperated systems. Cost information was available in only 2 studies: Arash (US $6000) [] and MAKI (US $2985) []. The price of Macrobot (US $27-$78) [] was obtained from commercial retail websites. For the other SARs, pricing information was obtained from the manufacturer’s specifications. Overall, 6 SARs were commercially available products, whereas Huggable and Arash were developed in research laboratories, and MAKI was custom-fabricated using 3D printing technology.
Table 3. Overview of SARsa, including cost, appearance, interaction features, technical specifications, and type of operation.
SARs
Cost (US $)
Appearance
Interaction features
Specifications
Type of operation
Arash []
6000
Humanoid (134 cm tall and 24 kg)
Voice, vision, facial expression, and gesture
Microphones, sensors, facial expression recognition, voice localization, camera, and screen
Preprogrammed automation
Huggable bear []
Not reported
Bear-like
Voice and gestures
Microphones, a camera, and fluffy
Teleoperated
Kebbi []
600
Robot-like (32 cm tall and 2.5 kg)
Voice, vision, and gesture
Microphones, camera, screen, and touch sensor
Preprogrammed automation
MAKI []
2985
Robot-like (34 cm tall and 2 kg)
Voice
Microphones, speech recognition, text-to-speech, and lights
Teleoperated
Macrobot []
27-78
Robot-like (20 cm tall and 0.25 kg)
Gestures and people following
Obstacle sensor, battery-powered, and wheel
Automation
NAO [-]
7500-13,000
Humanoid (57 cm tall and 5.5 kg)
Voice, vision, and gestures
Microphones, camera, LED, text-to-speech, and face detection
Hybrid
Paro []
6000
Seal-like (57 cm length and 2.7 kg)
Body movements react to stroking and cuddling
Microphones, fluffy, and touch sensor
Automation
Pepper []
32,000-49,900
Humanoid (120 cm tall and 28 kg)
Voice, vision, gestures, animations, and people detection
Microphones, cameras, LED, touch sensors, and tablet screen
Hybrid
Sanbot []
8500
Robot-like (90 cm tall and 19 kg)
Voice, vision, gestures, people detection and following, and animations
Microphones, cameras, LED, touch sensors, screen, and laser projector
Hybrid
aSAR: socially assistive robot.
Risk of Bias and GRADE Assessment
Eight studies were assessed as having some concerns regarding the overall risk of bias [-,,,,,], and 4 were assessed as having a high risk of bias [,,,]. The most frequent high-risk domains were deviations from the intended interventions (domain 2) and measurement of the outcome (domain 4; ). As the SARs intervention could not be blinded, some concerns were particularly identified in domain 2, where 1 trial [] was rated as high risk because its control group may have had an active role beyond that of passive control, potentially influencing the comparison with the intervention group. Two other studies were rated as high risk in domain 4 because the individuals assessing the outcomes also participated in the intervention, which may have introduced observer bias [,]. Additionally, 1 trial was rated as having a high risk of missing outcome data because it did not report 2 missing participants [].
Figure 2. Summary of risk of bias assessments across 13 included RCTs [-]. The risk of bias was evaluated across 5 domains. Most of the studies were identified as having some concerns, with deviations from the intended interventions (domain 2) being the most prevalent source of bias. D: domain; RCT: randomized controlled trial.
According to the GRADE assessment, all outcomes were rated as moderate-certainty evidence (). Pain reduction showed moderate-certainty evidence when compared with both standard and alternative care. Anxiety and fear reduction were also rated as moderate, indicating potential benefits but inconclusive effects. Distress reduction was similarly rated as moderate, supported by a single trial. Overall, these outcomes are considered clinically important; however, the certainty of evidence was limited by the risk of bias and the small number of studies.
The risk of bias was evaluated across 5 domains. Most of the studies were identified as having some concerns, with deviations from the intended interventions (domain 2) being the most prevalent source of bias.
Narrative Synthesis
The outcomes of the 13 studies varied by domain (). For primary pain level measures in 6 studies, significant reductions were observed in 1 study [], whereas the other 5 [,,,,] reported no significant differences, reflecting mixed evidence regarding the analgesic benefits of SARs. As participant and personnel blinding were unfeasible in SARs interventions, 4 trials were rated with some concerns, and 2 were high-risk in reporting bias and comparator response bias. Secondary emotion-related outcomes were anxiety, fear, distress, emotional engagement, state positive and negative emotion, and stress level. Stress-related physiological outcomes were more consistent across 1 trial, which demonstrated significant decreases in both salivary cortisol and heart rate []. Anxiety outcomes showed clearer benefits, with 6 studies reporting significant reductions [,,,,,], while studies had some concerns or a high risk of bias due to observer bias. Three studies reported null effects of fear [,,]. Of the 2 studies [,], only 1 reported a significant reduction in distress []. For state emotions, SARs enhanced emotional engagement and positive emotions in 2 studies [,]. Additionally, 2 studies documented greater engagement with SARs and narrative immersion [,]. Detailed statistical findings of each study are presented in .
Table 4. Summary of statistical results across studies, including pain, anxiety, fear, distress, stress, and emotional engagement outcomes.
Author (year)
Pain
Anxiety
Fear
Distress
Stress
Emotional engagement
Alemi et al (2016) []
NAa
↓b (P=.002)
NA
NA
NA
NA
Ali et al (2021) []
NSc (P=.13)
NA
NA
↓ (P=.047)
NA
NA
Beraldo et al (2019) []
NA
↓ (P=.047)
NS (P=.06)
NA
NA
NA
Chang et al (2023) []
NA
↓ (P<.05)
NA
NA
NA
↑d (P<.05)
Franconi et al (2023) []
NA
↓ (P=.03)
NA
NA
NA
NA
Jibb et al (2018) []
NS (P=.07)
NA
NA
NS (P=.06)
NA
NA
Lee-Krueger et al (2021) []
NS (P=.98)
NA
NS (P=.33)
NA
NA
NA
Logan et al (2019) []
NSe
NA
NA
NA
NA
NA
Meghdari et al (2018) []
NA
NA
NA
NA
NA
↑ (P<.03)
Okita (2013) []
↓ (P<.001)
↓ (P<.01)
NA
NA
NA
NA
Rossi et al (2022) []
NA
NA
NA
NA
↓ (P<.01)
NA
Topçu et al (2023) []
NA
↓ (P=.005)
NA
NA
NA
NA
Trost et al (2020) []
NS (P=.758)
NA
NS (P=.472)
NA
NA
NA
aNA: outcome not assessed.
b↓: significant decrease.
cNS: nonsignificant.
d↑: significant increase.
eThe exact P value was not reported in the original study.
Meta-Analysis
Among the 13 included studies, 7 met the criteria for this meta-analysis, involving a total of 359 participants. Pain was the primary outcome, whereas anxiety, fear, and distress were secondary emotional responses (). All pooled estimates were calculated using the Hartung-Knapp-Sidik-Jonkman random-effects method, and PIs were displayed on the forest plots, except for outcomes with very few included studies, such as fear and distress. Funnel plots were generated for pain and anxiety to provide a visual assessment for small-study effect (). As the number of included studies was very limited (pain, n=5; anxiety, n=3; distress, n=2; and fear, n=2), no Egger tests were conducted [].
Table 5. Summary of data extraction as mean (SD) from 7 studies in the meta-analysis, including outcomes: pain, anxiety, fear, and distress.
Author (year)
Pain
Anxiety
Fear
Distress
IGa
CGb
IG
CG
IG
CG
IG
CG
Alemi et al (2016) [], mean (SD)
NAc
NA
1.89 (0.20)
2.38 (0.43)
NA
NA
NA
NA
Ali et al (2021) [], mean (SD)
2.71 (2.96)
3.74 (3.08)
NA
NA
NA
NA
0.78 (1.32)
1.49 (2.36)
Jibb et al (2018) [], mean (SD)
1.00 (2.30)
1.40 (3.00)
NA
NA
NA
NA
1.60 (1.30)
1.40 (0.80)
Lee-Krueger et al (2021) [], mean (SD)
2.74 (2.96)
2.76 (2.97)
NA
NA
1.13 (1.02)
1.16 (1.26)
NA
NA
Okita (2013) [], mean (SD)
2.78 (1.92)
5.13 (2.30)
1.64 (0.31)
2.81 (0.53)
NA
NA
NA
NA
Topçu et al (2023) [], mean (SD)
NA
NA
2.74 (2.6)
4.5 (2.96)
NA
NA
NA
NA
Trost et al (2020) [], mean (SD)
1.55 (0.30)
2.47 (0.40)
NA
NA
1.80 (1.33)
2.10 (0.76)
NA
NA
aIG: intervention group.
bCG: control group.
cNA: outcome not assessed.
Pain
A total of 5 studies [,,,,] contributed data to the meta-analysis of pain outcomes, as illustrated in . The pooled analysis demonstrated a significant reduction favoring SARs interventions (difference in means=–0.89, 95% CI –1.32 to –0.47; 95% PI –1.29 to –0.49), with low heterogeneity (I²=11.9%, τ² < 0.0001, τ<0.01, P=.34). One study [] contributed the largest weight (85.1%), attributable to its smaller variance. The funnel plot showed slight asymmetry ().
Figure 3. Forest plot of the effect on pain outcomes [,,,,]. KH: Knapp-Hartung correction.
Anxiety
A total of 3 studies [,,] contributed to the meta-analysis of anxiety outcomes, as illustrated in . The random-effects model yielded a nonsignificant pooled effect (difference in means=–1.00, 95% CI –2.44 to 0.44; 95% PI –3.45 to 1.45), with substantial heterogeneity (I²=73.8%, τ²=0.2172, τ=0.466, P=.02). The funnel plot appeared symmetrical ().
Figure 4. Forest plot of the effect on anxiety [,,]. KH: Knapp-Hartung correction.
Fear
A total of 2 studies [,] contributed to the meta-analysis of fear outcomes, as illustrated in the forest plot (). The pooled analysis showed no significant effect of SARs interventions (difference in means=–0.04, 95% CI –1.72 to 1.64), with no detected heterogeneity (I²=0%, τ²=0, P=.53).
Figure 5. Forest plot of the effect on fear [,]. KH: Knapp-Hartung correction.
Distress
A total of 2 studies [,] were in the meta-analysis of distress outcomes, as illustrated in . The pooled analysis showed no significant effect of SARs interventions (difference in means=–0.23, 95% CI –6.00 to 5.54) with substantial heterogeneity (I²=65%, τ²=0.2693, τ=0.519, P=.09).
Figure 6. Forest plot of the effect of distress [,]. KH: Knapp-Hartung correction.
In summary, the meta-analysis provides evidence that SARs interventions may effectively reduce pain for children in the hospital. By contrast, the findings for anxiety, fear, and distress remain inconclusive due to nonsignificant pooled effects and considerable heterogeneity across studies.
Discussion
Principal Findings
This systematic review and meta-analysis synthesized evidence from 13 RCTs to evaluate the effectiveness of SARs in reducing pain and emotional outcomes, including anxiety, fear, and distress, among pediatric patients in hospital settings. Beyond the meta-analysis, our review conducted a comprehensive narrative analysis, integrating intervention characteristics and contextual factors to provide an understanding of real-world clinical implementation and future research design. Overall, the pooled analysis suggested that SARs interventions may offer beneficial effects for pain reduction, whereas their impact on emotional outcomes was not statistically significant. However, these findings should be interpreted with caution, given the presence of some concerns and high risks of bias in several domains, as well as the overall moderate certainty of evidence. Importantly, these results have practical relevance for health care providers and researchers, offering insights for future clinical implementation and study design aimed at adopting SARs as child-friendly and effective adjuncts in pediatric hospital care.
Pain
SARs interventions demonstrated a statistically significant reduction in children’s pain, providing moderate-certainty evidence that such interventions may help alleviate pain in hospital settings. Among the 5 studies synthesized, 1 trial [] was rated as high risk due to reporting bias and lack of blinding, while the others were rated as having some concerns. Notably, this high-risk study accounted for a large weight in the meta-analysis, suggesting that the pooled effect for pain may be disproportionately influenced by it and should therefore be interpreted with caution.
The PI was slightly narrower than, but consistent with, the effect of the CI. As prior studies [,], a narrower PI may indicate low between-study heterogeneity, which in this study could also reflect the large weighting of a single trial influencing the pooled estimate and reducing observed variability. This pattern suggests that similar beneficial effects may be observed under comparable conditions, but the limited evidence base warrants a conservative interpretation of these findings.
From a clinical perspective, these results imply that when intervention protocols, implementation settings, and participant characteristics are similar, clinicians may expect consistent and meaningful pain reduction with the use of SARs. In practice, SARs can provide distraction, emotional support, and engagement as adjuncts to standard pain management strategies. The combination of a statistically robust pooled effect and PI offers moderate yet credible evidence that SARs can reduce children’s pain perceptions during hospital-based procedures.
However, the duration of SARs interventions varied considerably across studies, revealing a lack of standardization in exposure time. Due to this variability, a dose-response relationship between intervention length and pain reduction could not be established. While short, single-session interventions may be well-suited for acute procedural pain, current evidence remains insufficient to confirm sustained benefits for children undergoing longer hospital stays. Collectively, these findings position SARs as promising, child-friendly adjuncts within multimodal pediatric pain management, though further methodologically rigorous and well-powered RCTs are needed to consolidate their clinical credibility, optimize implementation protocols, and determine long-term therapeutic potential.
Anxiety, Fear, and Distress
The emotional outcomes revealed a more complex and context-dependent pattern compared with the primary pain outcomes. Among the studies included in this review, SARs interventions appeared effective in reducing children’s anxiety when both self-reported and observer-rated measures were considered. However, the meta-analysis, which primarily focused on children’s self-reported anxiety scales, did not yield a statistically significant pooled effect. This divergence is likely attributable to differences in outcome measurement. Previous meta-analyses [-] reported significant reductions in anxiety, which typically combined observer-rated assessments with children’s self-reports, whereas our analysis distinguished between the two. This distinction reflects that anxiety, as an inherently subjective emotional experience, is best captured through the individual’s own perspective [,]. The nonsignificant result observed in our analysis aligns with prior evidence showing discrepancies between observer- and self-reported measures [], underscoring the need for further investigation into how these differing perspectives capture children’s emotional experiences. The overall moderate certainty of evidence reflects methodological limitations identified in the included trials, particularly the risk of bias from the nonblinded nature, inadequate statistical power, and reporting bias.
Furthermore, the CI reflects the average effect in this meta-analysis, while the wide PI illustrates the likely variation in true effects in future studies and clinical contexts [,]. The wide PI observed for anxiety suggests that the true effects of SARs may vary substantially across clinical contexts, indicating that while some settings may observe meaningful emotional benefits, others may experience null or even opposite effects. The statistical heterogeneity for anxiety and distress can be attributed to significant methodological and clinical context differences across the included trials. The studies varied widely in their clinical settings, study populations, intervention designs, and the specific features of SARs. Such variability likely reflects differences between included studies, rather than inconsistency in the underlying potential of SARs. This highlights the importance of contextual and implementation factors in shaping the emotional outcomes of SARs interventions. However, due to the limited number of studies, these findings should be interpreted with caution.
These contextual variations suggest that the effectiveness of SARs may be highly specific to a particular population, clinical context, or interaction mode. From a practical perspective, these findings emphasize the need for an approach grounded in real-world clinical contexts to ensure effective and meaningful integration of SARs into patient care. Overall, the evidence of SARs deployment for emotional support in pediatric hospital settings was limited, highlighting the need for more standardized trials to address these methodological and contextual variations.
Clinical and Practical Implications
The evidence from this review indicates that SARs represent an engaging and child-friendly adjunct for pain management in pediatric hospital settings. Our pooled results demonstrated a statistically significant reduction in pain, and the PI suggested that these benefits may be reproducible in similar clinical contexts. However, the current evidence for emotional outcomes remains limited and heterogeneous, emphasizing the need for caution in their implementation for psychosocial support.
The successful integration of SARs into clinical practice necessitates careful consideration of feasibility, ethical implications, and long-term sustainability. Clinically, SARs function primarily as assistants, supporting but not replacing human caregivers. Therefore, effective implementation requires comprehensive staff training in interaction protocols and hygiene management, alongside strong institutional support to ensure appropriate use and maximize clinical benefits. In addition, reliable technical support and regular maintenance are essential to sustain functionality, particularly in hospital settings that may have limited access to specialized technological personnel.
From an institutional perspective, performing a thorough cost-effectiveness analysis is essential. The initial acquisition costs of the SARs varied greatly and needed to be considered alongside the ongoing maintenance costs of hardware and software. A strategic evaluation of cost-effectiveness involving the adoption of innovative technologies, beginning with pilot studies to assess clinical feasibility before expanding to broader use, can further facilitate the full integration of SARs into health care settings.
Ethical Considerations
Ethical dimensions are critical for the implementation of SARs in pediatric hospital care, particularly regarding safety, privacy, and autonomy [,]. Only 4 of the 13 included studies addressed ethical considerations, primarily focusing on children’s physical and psychological safety [,,,]. The evidence currently offers limited insight into the broader ethical dimensions of human-robot interaction. Therefore, we expanded upon these critical ethical considerations.
Beyond safety, privacy is a crucial issue, requiring secure data storage, parental consent, and adherence to data protection standards [,,]. Psychological considerations and autonomy also warrant attention, while a few children may experience fear or negative experiences [,]. While SARs can provide comfort and support, some children may experience fear or discomfort [,,]. These risks intersect with the question of autonomy, particularly as children’s interactions with robots may influence their social and emotional development.
The automation level of SARs varied across included studies; notably, 11 trials used hybrid or operator-guided systems. Such approaches may represent the safest balance between technological novelty and patient safety in current clinical practice [,,,].
Strengths and Limitations
The primary strength of this review lies in its rigorous, systematic approach, coupled with the innovative integration of comprehensive contextual synthesis, cost-effectiveness, and ethical dimensions. The meta-analysis also allowed us to quantify and interpret the effect of SARs statistically. These contribute a framework for understanding SARs’ application relevant to real clinical practice.
However, several limitations should be acknowledged. The heterogeneity in methodological designs across included studies constrained the comparability of findings. The limited number of eligible trials presents a significant methodological constraint to performing subgroup analyses, particularly concerning statistical power. Although funnel plots were conducted to visually assess potential asymmetry, the small number of eligible trials constrained the reliable assessment of small-study effects (Egger test), as statistical power is limited with few studies []. Last, the moderate certainty of evidence underscores the need for greater methodological rigor in future research. In summary, these factors suggest that while the findings offer meaningful insights, they should be interpreted with appropriate caution and contextual awareness.
Future Research Directions
To address the risk of bias concerns identified in this review, future RCTs should adhere to rigorous methodological and reporting standards. Larger, well-designed, and adequately powered studies are warranted to reduce imprecision and enhance generalizability. As participant and personnel blinding are inherently unfeasible in SARs interventions, alternative strategies are suggested to minimize observer and response bias. These may include the use of blinded outcome assessors, standardized intervention protocols, and integrating objective indicators (eg, physiological parameters, objective behavioral indicators, speech emotion recognition, or facial expression recognition) to mitigate human influence during assessment.
As pain and emotions are inherently subjective experiences, self-reported measures remain the most direct indicators. However, combining validated self-report instruments with objective or observer-based assessments may provide a more comprehensive and balanced understanding. Transparent reporting of contextual and procedural factors will further facilitate comparability and reproducibility.
Moreover, research may expand beyond mitigating negative emotions to explore how SARs promote positive emotional responses and evaluate multisession interventions to determine sustained effects. Technological development is also crucial for improving system robustness, minimizing technical failures, and enhancing the usability of the operation. Notably, integrating ethical considerations, including child autonomy, privacy, and data protection, is essential for responsible future research.
Conclusion
This systematic review and meta-analysis suggest that SARs have potential as a valuable adjunct for pain management in pediatric hospital care. The observed reduction in pain across comparable clinical contexts indicates that SARs can provide consistent and clinically meaningful benefits when appropriately implemented. In contrast, the evidence for their effects on emotional outcomes remains ambiguous. The wide PI observed for anxiety suggests that the effects of SARs may vary substantially across clinical contexts, while some children may experience emotional benefits, others may show null or even opposite effects, highlighting the important role of contextual factors of SARs implementation. The overall concerns of risk of bias underscore the need for methodological rigor in future research to consolidate the evidence base.
At present, SARs can be regarded as a promising nonpharmacological tool for pain management. Their ethical and effective integration into pediatric practice requires adherence to clear principles that prioritize child-friendly care. Moving forward, research should combine technological innovation with psychosocial intervention design to evaluate the cumulative effects of multisession SARs interactions and to explore their potential to enhance positive emotions, engagement, and resilience. Through such evidence-driven and ethically grounded development, SARs may evolve into a vital component of child-centered digital health, fostering more positive and supportive health care experiences for children.
For significant contribution to the rigor and completeness of this review, this review’s authors gratefully acknowledge the studies’ authors for providing the original data for the meta-analysis. This study was partially funded by the Ministry of Science and Technology, Taiwan (NSTC 113-2410-H-182-011-MY2), and Chang Gung Medical Foundation (CMRPD1N0342). We used the GenAI (generative artificial intelligence) tool ChatGPT by OpenAI to assist with English language editing. We thank Dr Peter Pin-Sung Liu, Population Health Data Center, National Cheng Kung University, Tainan, Taiwan, for his assistance with statistical analyses and for providing valuable comments on the statistical methodology during the revision process. We also thank the Reference and Liaison Librarian for the College of Medicine, Ms Yi-hua Liu, for consulting on developing a detailed search strategy. All outputs were subsequently reviewed and revised by this study’s team.
All data analyzed in this study are included in the paper. Further details are available from the corresponding author upon reasonable request.
None declared.
Edited by A Mavragani, S Brini; submitted 07.May.2025; peer-reviewed by D Poddighe, S Ali; comments to author 12.Sep.2025; accepted 24.Oct.2025; published 26.Nov.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Your guide to what Trump’s second term means for Washington, business and the world
Transactions of $10bn or more have hit an all-time record in 2025 after Donald Trump’s deregulatory push unleashed Wall Street’s animal spirits and a blitz of global dealmaking.
Naver’s $10.3bn all-stock acquisition of South Korea’s biggest crypto exchange Upbit on Wednesday took this year’s megadeal total to 63, topping the 2015 record, according to LSEG data on transactions since 1988.
The frenzy comes despite a sluggish start to the year after the US president’s “liberation day” tariffs sparked weeks of market volatility and deep uncertainty about interest rates and the global economic outlook.
“Companies are taking advantage of this window to pursue the larger transactions that they’ve long wanted to do and have been expected by the market,” said Ivan Farman, global co-head of mergers and acquisitions at Bank of America.
“When you see big deals being struck in your industry, you don’t want to be left out when the chess pieces move.”
Deals roared back in the second half of 2025 as CEOs pounced on once-in-a-generation transactions, including Union Pacific’s $85bn bid for Norfolk Southern, the $55bn Saudi-backed take-private of Electronic Arts, Anglo American’s $50bn merger with Teck and Kimberly-Clark $49bn takeover of Tylenol maker Kenvue.
Edward Lee, a corporate partner at Kirkland & Ellis, said CEOs and boards now had the “confidence and visibility” to chase “big strategic moves that they postponed for two years because of interest-rate uncertainty, inflation and the election”.
The greater visibility would allow deals that were previously hitting regulatory roadblocks to finally get done, Lee added.
The second half of the year deal blitz comes after Trump pulled back from a full-blown trade war with China and choked back some of his most aggressive tariffs, all while doubling down on M&A-friendly measures, including relaxing antitrust rules.
“There’s a feeling right now in the current regulatory environment that there’s a chance to do larger-scale transactions that you may not have the opportunity to do again,” said Krishna Veeraraghavan, co-head of Paul Weiss’s M&A group.
The animal spirits have spread across sectors. Bank M&A surged as deals were approved at the fastest pace in more than three decades, while Big Pharma roared back, acquiring biotech assets to restock their drug pipelines. A boom in artificial intelligence spurred a wave of tech and data centre transactions.
“We’re seeing increased activity not just in tech, driven by a tsunami of money going into AI infrastructure, but also in healthcare, industrials, financial and other sectors,” said Drago Rajkovic, global co-head of M&A at Citigroup.
“Why are there so many large deals? There has been a lot of pent-up demand, a favourable regulatory environment and healthy balance sheets,” he added.
But M&A has been stronger among larger companies than smaller ones, a sign that deal activity remains uneven.
“Small deals are often harder to get done as they’re less interesting to buyers because they don’t move the needle. Fundamentally, smaller deals have lower returns, so there’s a trend towards our clients focusing on large transactions,” said Andrew Woeber, global head of M&A at Barclays.
As financial traders milled around 26 floors up in a tower in the Canary Wharf district of London, there was little sign of nerves ahead of Rachel Reeves’s second budget – until the surprise accidental early release of the government’s official economic analysis started to move markets.
Headline numbers from the Office for Budget Responsibility (OBR) flashed through on banks of computer screens, followed shortly by the detailed analysis itself.
“Boom! There’s your 200-pager,” said Will Marsters, a sales trader at Saxo UK, a trading platform that hosted the Guardian for the announcement. The leak triggered a race across trading desks in the City of London to understand the implications of the leaked forecasts – and laughter at the hapless forecaster.
Traders at Saxo UK gathered for the budget announcement. Photograph: Sean Smith/The Guardian
It was a chaotic start to the budget, but more important for financial investors and the Treasury was the reaction on currency and bond markets. The Labour government was desperate to avoid a repeat of the Liz Truss “mini-budget” debacle, when borrowing costs surged, eventually bringing about the downfall of the Conservative government.
The reaction on Wednesday was choppy, but not dramatic by the standards of the Truss government. The yield on the benchmark 10-year gilt – a measure of the cost of government borrowing – dropped quickly from 4.5% to about 4.42%. A few minutes later it was back up above 4.52%.
By the late afternoon yields had fallen back once more, to 4.4%. The declining borrowing cost over the day will likely be a relief for Reeves – and a sign that markets do not think lending money to the UK has become more risky.
“The tempered growth didn’t seem too optimistic, which eroded some of the risk premium,” said Marsters.
Graph showing dip in cost of borrowing over the day
Neil Wilson, an investor strategist at Saxo UK, said: “There’s no great stinging surprise that has upset markets. That has allowed it to be a bit of a relief.”
However, he wondered about the credibility of the forecasts: governments often promise to tighten budgets in later years in order to make the sums add up. With elections expected around the same time, he said the prospect of welfare cuts or tax rises in four years’ time was remote.
skip past newsletter promotion
after newsletter promotion
“You’re saying we’re going to buy fiscal restraint by the end of the parliament,” Wilson said. “‘Don’t worry about welfare – we’ll sort it out’.”
‘Everyone was fearing the worst,’ said one trader at Saxo UK. Photograph: Sean Smith/The Guardian
The value of the pound also jumped in initially volatile trading after the OBR leak. It then fell as low as $1.3124, before recovering by late afternoon to $1.3229 – an increase of 0.5% for the day.
Mike Owen, another sales trader, said: “Everyone was fearing the worst, so the price action is, ‘Phew’. It’s such a minefield to try to get through it.”