(Bloomberg) — The stock market crept higher, but stopped short of records Friday, as traders refrained from making big bets ahead of the Federal Reserve’s interest-rate cut decision next week. Treasuries are on track for their worst week since June.
The S&P 500 rose 0.2%, paring back from an earlier 0.6% jump that put it within a whisker of October’s all-time high. The Nasdaq 100 climbed 0.4% while the Russell 2000 gauge of smaller companies slipped after closing at a record on Thursday. Treasuries extended losses with the yield on the 10-year climbing to 4.14%.
A dated reading of the Federal Reserve’s preferred inflation gauge did little to shift Wall Street’s expectations of a rate cut next week with swaps bets pointing to further easing into 2026.
Subscribe to the Stock Movers Podcast on Apple, Spotify and other Podcast Platforms.
The core personal consumption expenditures price index, a measure that excludes food and energy, rose 0.2% in September, inline with economists expectations for a third-straight 0.2% increase in the Fed’s favored core index. That would keep the year-over-year figure hovering a little below 3%, a sign that inflationary pressures are stable, yet sticky.
“Overall, the data was consistent with another 25 basis point Fed cut next week, but it doesn’t suggest any urgency for the Fed to accelerate the pace of cuts in 2026,” said BMO’s Ian Lyngen.
A December rate cut is a not given for every Fed watcher. BlackRock CIO of Global Fixed Income Rick Rieder told Bloomberg Television before the data that he is expecting some dissents and disagreement at the next meeting.
Meanwhile, sentiment toward technology stocks got a boost after Nvidia Corp. partner Hon Hai Precision Industry Co. reported strong sales. Moore Threads Technology Co., a leading Chinese AI chipmaker, jumped 425% in its Shanghai trading debut. Shares of Netflix Inc. slid after agreeing to a tie-up with Warner Bros. Discovery Inc.
In a sign that institutional appetite for the world’s largest cryptocurrency remains subdued, BlackRock Inc.’s iShares Bitcoin Trust ETF (IBIT) recorded its longest streak of weekly withdrawals since debuting in January 2024.
Investors pulled more than $2.7 billion from the exchange-traded fund over the five weeks to Nov. 28, according to data compiled by Bloomberg. With an additional $113 million of redemptions on Thursday, the ETF is now on pace for a sixth straight week of net outflows. A drop in Bitcoin deepened, falling below $90,000 on Friday.
What Bloomberg Strategists say…
There are two things stand in the way of a year-end rally — and both are on display today. One is the third downdraft in crypto prices in the last two weeks, which has sent Bitcoin back below $90,000. Such a pullback served to dampen risk sentiment on two previous occasions in November.
—Edward Harrison, Macro Strategist, Markets Live
For the full analysis, click here.
WTI crude steadied around $60 a barrel. Gold erased earlier gains.
Corporate News
SoftBank Group Corp. is in talks to acquire DigitalBridge Group Inc., a private equity firm that invests in assets such as data centers, to take advantage of an AI-driven boom in digital infrastructure. Netflix Inc. agreed to buy Warner Bros. Discovery Inc. in a historic combination, joining the world’s dominant paid streaming service with one of Hollywood’s oldest and most revered studios. Southwest Airlines Co. lowered its operating profit target for the full year, citing the fallout from the recent US government shutdown as well as higher fuel prices. Health-care group Cooper Cos’ shares jumped in premarket trading after a guidance beat and the launch of a strategic review. Moore Threads Technology Co., a leading Chinese artificial intelligence chipmaker, soared as much as 502% in its Shanghai debut after raising 8 billion yuan ($1.13 billion) in an IPO. Nvidia Corp. would be barred from shipping advanced artificial intelligence chips to China under bipartisan legislation unveiled Thursday in a bid to codify existing US restrictions on exports of advanced semiconductors to the Chinese market. Some of the main moves in markets:
Stocks
The S&P 500 rose 0.2% as of 3:17 p.m. New York time The Nasdaq 100 rose 0.4% The Dow Jones Industrial Average rose 0.3% The MSCI World Index was little changed Currencies
The Bloomberg Dollar Spot Index fell 0.1% The euro was little changed at $1.1643 The British pound was little changed at $1.3331 The Japanese yen fell 0.1% to 155.28 per dollar Cryptocurrencies
Bitcoin fell 2.7% to $89,661.75 Ether fell 2.6% to $3,042.55 Bonds
The yield on 10-year Treasuries advanced four basis points to 4.14% Germany’s 10-year yield advanced three basis points to 2.80% Britain’s 10-year yield advanced four basis points to 4.48% Commodities
West Texas Intermediate crude rose 0.7% to $60.08 a barrel Spot gold was little changed This story was produced with the assistance of Bloomberg Automation.
–With assistance from Levin Stamm, Neil Campling and Sidhartha Shukla.
More than 1 million mostly pornographic images and videos generated by an artificial intelligence tool — some of which appear to feature underage individuals and some based on photos of real people — were left exposed online, according to…
One of the largest global hedge funds accused two former developers of conspiring with a competitor — their new employer — to replicate proprietary software. Counsel for the former developers engaged FTI Consulting’s trading industry and technology experts to assess and disprove the allegations by demonstrating that the hedge fund’s systems were based on widely known and commonly used principles.
Our Impact
Through an extensive analysis of the hedge fund’s data storage and analytics systems, our experts delivered three comprehensive reports demonstrating the software in question was based on widely known and commonly applied computer science principles.
The findings strengthened our clients’ position in arbitration proceedings by establishing a solid technical foundation that helped achieve a successful legal outcome.
By disproving the allegations, our clients reinforced their credibility and positioning in the market and clarified the scope of innovation in their technology.
Our Role
FTI Consulting conducted extensive qualitative research on each of the allegedly unique and valuable features of the hedge fund’s data storage and modeling systems.
Our experts traced the origins of the mathematical and computer science principles underlying these features to demonstrate their widespread use prior to their adoption by the hedge fund.
Leveraging our deep technical and industry expertise, we identified multiple commercially available products with identical features as the hedge fund’s data storage system, as well as in-house software systems at other large financial institutions with striking similarities to its trading modeling system.
FTI Consulting submitted three expert reports and delivered oral testimony during a two-week arbitration trial that concluded in April 2025.
More than 90% of the world’s adolescents live in low- and middle-income countries (LMICs) where large treatment gaps for depression and anxiety prevail []. In rural South Africa, mental health services are scarce, leaving young…
In recent years, large language models (LLMs) have garnered significant attention across various fields, emerging as transformative tools in sectors such as health care []. Over the past decade, research output focusing on LLM applications in medical and health domains has grown exponentially []. Advances in natural language processing and deep learning, particularly the Transformer architecture and its core self-attention mechanism [], have enabled the increasing application of LLMs, such as ChatGPT, in clinical nursing practice. These systems support real-time triage [], generate diagnostic recommendations [], recommend nursing interventions [,], and develop health education plans [], thereby improving nursing efficiency. The effectiveness of LLMs in clinical care has been well-documented by several studies [-], demonstrating their potential to improve patient outcomes and care quality.
Sociodemographic factors critically influence the quality and accessibility of nursing care, with pervasive disparities documented across key demographic variables, including age, sex identity, geographic location, educational attainment, and socioeconomic status []. For example, labeling female patients as “demanding” or “overly sensitive” may skew symptom management decisions, resulting in disparities in care [,]. Similarly, ageism may influence nursing decisions, where older patients are stereotyped as “fragile” and may receive either excessive protective care or inadequate treatment due to perceptions that they are “too old to benefit significantly” [,]. Moreover, patients from socioeconomically disadvantaged backgrounds often face barriers to care compared to wealthier patients, exacerbating disparities in health care outcomes []. These documented human cognitive biases in nursing practice may be inadvertently encoded into LLMs through their training on historical clinical narratives and decision records [].
The technical validation of LLMs in nursing has progressed rapidly. Previous studies have demonstrated superior accuracy of nurses in tracheostomy care protocol execution [] and in generating basic mental health care plans []. However, the field remains predominantly focused on validating clinical competency rather than auditing algorithmic equity. Recently, a systematic review of 30 nursing LLM studies revealed that the majority of studies prioritized technical performance metrics (eg, diagnostic accuracy and response consistency), with only a small number addressing ethical risks, such as algorithmic bias []. This trend indicates a research landscape heavily skewed toward performance validation while largely neglecting equity auditing. Furthermore, these limited discussions on bias are primarily found in opinion pieces and reviews rather than empirical investigation [,]. To date, few original studies have used rigorous quantitative experimental methodologies to explore the potential biases embedded within LLM-generated nursing care plans.
Although previous studies have identified algorithmic bias in other domains of medical artificial intelligence (AI), such as Convolutional Neural Network-based medical imaging analysis [,], traditional machine learning models (eg, support vector machines or random forests) for clinical diagnostics [], and disease prediction [], most have primarily focused on racial, ethnic, and sex factors. Other sociodemographic dimensions, such as education, income, and place of residence, also have a great impact on health care resource utilization [-]. This focus highlights a critical gap concerning the fairness of generative models such as LLMs, whose unique capacity for narrative text generation introduces distinct ethical challenges not fully addressed by research on these earlier models. Despite the need to ensure fairness has been widely recognized, serving as a cornerstone of the World Health Organization’s LLMs management framework [], empirical fairness evaluations specific to nursing care planning remain limited, and systematic audits that include education, income, and urban-rural residence are still uncommon.
While prior research has documented bias in AI diagnostics, the extent to which generative models introduce sociodemographic bias into the complex narrative of clinical care plans has remained a critical gap. To our knowledge, this study represents the first large-scale evaluation (N=9600) to use a mixed methods approach. By inputting specific prompts based on real clinical scenarios, we systematically investigated biases in both the thematic content and the expert-rated quality of LLM-generated nursing care plans. Therefore, this study aimed to systematically evaluate whether GPT-4 reproduces sociodemographic biases in nursing care plan generation and to identify how these biases manifest across linguistic and clinical dimensions. Through this mixed methods design, we sought to provide empirical evidence on the fairness, risks, and limitations of generative AI in nursing contexts, thereby informing its fair, responsible, and effective integration into future nursing practice.
Methods
Study Design
This study used a sequential explanatory mixed methods design to investigate sociodemographic bias in LLM-generated nursing care plans. First, a quantitative analysis was conducted to assess whether the thematic content of care plans varied by patient sociodemographic factors. Subsequently, a qualitative assessment was used to explain these findings, wherein a panel of nursing experts rated a subsample of plans on their clinical quality. Our study integrated 2 distinct research methods. The primary goal was to identify potential biases in the presence or absence of specific care themes. Beyond this, we aimed to understand if the clinical quality of the provided care also differed systematically across demographic groups.
Clinical Scenario Design and Experiment Setup
Selection of Clinical Scenario and Methodological Rationale
This study used a standardized clinical vignette experiment, an established methodology in behavioral and health care research. To be clear, we did not use real patient charts or identifiable data from any hospital. Our scenario was a standardized tool designed for rigorous experimental control, not a case report of an individual patient.
We chose this established method for 2 core reasons. First, it ensures scientific rigor by eliminating the confounding variables found in unique patient cases. This allows us to isolate the effects of the manipulated sociodemographic variables. Second, the method upholds strict ethical standards by avoiding the use of any protected health information.
Our vignette depicts a cardiac patient becoming agitated after multiple failed attempts at IV insertion. This scenario design parallels the approach of prior research, such as Guo and Zhang (2021) [], which used a similar common clinical conflict to investigate bias in doctor-patient relationships. It was then reviewed and validated by our panel of senior nursing experts to ensure its clinical realism. This experimental paradigm is a standard and accepted method for investigating attitudes and biases in behavioral sciences and health care research [].
Patient Demographics
This study examines potential biases in LLM-generated nursing care plans related to key patient sociodemographic characteristics, including sex, age, residence, educational attainment, and income. These are widely recognized as social determinants of health that directly influence nursing care delivery and patient outcomes []. As these factors have long shaped traditional nursing practice, it is reasonable to anticipate that they may similarly affect the recommendations generated by LLMs.
Sex (male vs female) may impact both the emotional tone and the clinical content of nursing care plans, as previous research indicates that health care providers may unconsciously manage similar symptoms differently depending on the patient’s sex. Specifically, female patients are more likely to be recommended psychological support, whereas male patients may receive more pharmacological or technical interventions under similar clinical scenarios [].
Age (categorized as youth, middle-aged, older middle-aged, and elderly) is a critical factor affecting nursing care needs. We defined youth as 18 to 29 years, middle-aged as 30 to 49 years, older middle-aged as 50 to 64 years, and elderly as ≥65 years []. Older patients often require more complex, chronic condition management and personalized interventions [].
Residence (urban vs rural) is another significant variable, as patients in rural areas often face limited access to health care resources compared to their urban counterparts [].
Income level (categorized as high, middle, or low) plays a critical role in determining both the accessibility of health care services and the complexity of care provided. Specifically, low income was defined as falling below the 25th percentile of the sample distribution, middle income between the 25th and 75th percentiles, and high income above the 75th percentile. Patients with lower income may be more likely to receive standardized care that overlooks individual needs or preferences [].
Educational background (higher education vs lower education) influences a patient’s understanding of care instructions and their level of engagement with the health care process. In this study, higher education was defined as holding a bachelor’s degree or above, whereas lower education referred to individuals with less than a bachelor’s degree. Patients with higher education may be more proactive in managing their care, whereas those with lower education may require more guidance and support [].
AI Model and Experimental Tools
This study used GPT-4 to generate nursing care plans through the Azure OpenAI API, a widely accessible and cost-effective platform that is freely available for use, making it easier for health care providers to adopt in clinical practice. A temperature parameter of 0.7 was set to balance creativity and stability in the generated content, ensuring moderate randomness without compromising quality or consistency [].
Experimental Procedure
Patient Profile Input
The LLMs received a patient profile that includes the following key demographic characteristics: age, sex, income level, educational background, and residence, along with a detailed clinical scenario. For example, 1 prompt describes a 28-year-old male cardiac patient, a high-income earner with a bachelor’s degree residing in an urban area, who requires an intravenous infusion. During the procedure, the nurse was unable to locate the vein, resulting in a failed puncture attempt. The patient subsequently became emotionally distressed and verbally insulted the nurse. The full text of the clinical vignette, the base prompt template, and a detailed table of all variable substitution rules are provided in .
AI Model Prompt
For each combination of patient profile, the LLMs generated a nursing care plan in response to a structured prompt. The prompt instructed the model to provide an appropriate nursing care plan based on the described clinical scenario. illustrates the workflow for LLM-based nursing care plan generation, outlining the process from patient data input to care plan output. All 9600 nursing care plans were generated via the API between August 29 and August 30, 2025.
Figure 1. Flowchart of the LLM-generated nursing care plan generation process. LLM: large language model.
Prompt Design and Standardization
To minimize output variability arising from prompt phrasing and inherent model randomness [] and thereby isolate the effect of sociodemographic factors, we implemented a rigorous standardization protocol. This protocol involved three key strategies: (1) using a single, consistent clinical vignette for all tests; (2) using a uniform prompt structure across all tests; and (3) performing 100 repeated queries for each of the 96 unique patient profiles to account for natural fluctuations in the model’s output.
Repetition and Testing
For each clinical scenario, we designed multiple prompts to reflect all unique combinations of patients’ identity characteristics. Consequently, it contained 96 unique combinations (2×4 × 2×3 × 2), derived from sex (2 levels), age (4 levels), residence (2 levels), income level (3 levels), and educational background (2 levels). To reduce potential bias from prompt phrasing, each combination was tested 100 times, yielding a total of 9600 prompt-based care plan generations.
Data Collection and Analysis
Thematic Analysis and Framework Development
We analyzed data using thematic analysis, following Braun and Clarke’s approach []. In the first stage, 2 trained qualitative researchers independently reviewed approximately 1000 LLM-generated nursing care plans. This initial review continued until thematic saturation was reached. They conducted line-by-line inductive coding during this stage and read these care plans repeatedly to get familiar with the data. Initial codes were generated independently and then reconciled through consensus discussions. Using constant comparison, conceptually similar codes were organized into candidate themes and iteratively reviewed for coherence with the corpus and key excerpts, with refinement by splitting, merging, or renaming as needed. This process yielded a finalized codebook consisting of 8 recurrent themes.
In the second stage, using the finalized codebook, the same 2 researchers manually coded all 9600 care plans in the corpus. Both researchers coded each plan for the presence of each predefined theme, recording a binary indicator (1=present and 0=absent). Coding consistency was ensured through regular consensus meetings; any discrepancies were resolved by discussion until agreement was reached. An audit trail of analytic notes and coding decisions was maintained to support transparency. These binary indicators were subsequently used in the quantitative analyses (see for the detailed coding manual).
Analysis of Thematic Distribution and Associated Factors
All statistical analyses were performed in Python (version 3.12). Every statistical test was 2-sided, and a P value (q value) adjusted for the False Discovery Rate of less than .05 was considered significant.
Descriptive statistics were used to summarize the data. Categorical variables were reported as frequencies and percentages, and the prevalence of each theme was calculated with 95% CIs via the Clopper-Pearson exact method.
We first explored the associations between demographic characteristics and theme occurrence using the Chi-square or Fisher exact test. We then calculated Cramer V to measure the strength of these associations and applied the Benjamini-Hochberg procedure to the resulting P values to control for multiple comparisons.
To delineate the independent predictors for each theme, we constructed multivariable regression models. Our primary strategy was logistic regression, yielding adjusted odds ratios and 95% CIs. For any models that failed to converge, we used modified Poisson regression with robust SEs to obtain adjusted relative risks (aRRs). Finally, all P values from the model coefficients were adjusted using the Benjamini-Hochberg method, and the key findings were visualized in forest plots.
Expert Assessment of Quality and Bias Analysis
Overview
Following the quantitative thematic analysis, we conducted a qualitative expert review to explain and add clinical depth to the observed patterns. A sample size of 500 was determined a priori through a power analysis to ensure sufficient statistical power for the subsequent multivariable regression models.
To ensure this subsample was representative and unbiased, we used a stratified random sampling strategy. We stratified the full sample of 9600 plans by the 96 unique sociodemographic profiles and then randomly selected approximately 5 plans from each stratum.
The expert review was conducted at Renmin Hospital of Wuhan University. The panel consisted of 2 independent registered nurses from the Department of Cardiology, each with more than 15 years of direct inpatient cardiovascular nursing experience. Panel members were identified by the nursing director and recruited via departmental email. Participation was entirely voluntary, and no financial compensation was provided. Each plan was rated on a 5-point Likert scale (1=very poor to 5=excellent) across three core dimensions derived from established quality frameworks: safety, clinical applicability, and completeness. These dimensions were adapted from the Institute of Medicine’s established framework for health care quality []. To ensure a standardized assessment, a comprehensive rating manual containing detailed operational definitions and anchored scale descriptors was developed. Furthermore, the panel completed a formal calibration exercise before the main review to ensure a shared understanding of the criteria (see ).
Data Analysis
Interrater reliability of the initial, independent ratings was quantified using two complementary metrics: the intraclass correlation coefficient (ICC) and the quadratically weighted kappa coefficient (κ). We used a 2-way random effects model for absolute agreement to calculate the single-rater ICC (ICC [2,1]) []. On the basis of the established benchmarks, reliability values between 0.61 and 0.80 are interpreted as ‘substantial’ agreement, whereas values from 0.81 to 1.00 represent ‘near-perfect’ agreement []. After confirming reliability, a final quality score was determined for each case: for cases with a major disagreement (a rating difference of ≥2 points), a third senior expert adjudicated to assign a consensus score; for all other cases, the mean of the 2 experts’ scores was used. These final scores then served as the continuous dependent variables in a series of multivariable linear regression models, which assessed the independent association between patient demographic characteristics and expert-assigned quality.
Ethical Considerations
The standardized clinical vignette used in this study is a synthetic material, constructed by the authors for this research. The Biomedical Institutional Review Board of Wuhan University reviewed the project and determined that it does not constitute human subjects; therefore, formal institutional review board approval and informed consent were not required.
Results
Descriptive Characteristics of the Sample and Themes
A total of 9600 nursing care plans generated by the LLM were included in the analysis. The sociodemographic characteristics of the corresponding patient profiles are detailed in . Regarding the thematic content, 8 consistent nursing themes were identified across these outputs. Communication and Education and Emotional Support and Stress Management were nearly universal, appearing in 99.98% (95% CI 99.92%‐100%) and 99.97% (95% CI 99.91%‐99.99%) of cases. Other highly frequent themes included Technical Support and IV Management (91.69%) and Safety Management with Risk Control (89.31%). In contrast, Family Support (72.81%), Environmental Adjustment (68.42%), and Pain and Medication Management (47.85%) appeared less frequently. The least common theme was Nurse Training and Event Analysis, which was present in only 39.32% (95% CI 38.34%‐40.31%). The overall distribution of nursing themes is summarized in and and visualized in .
Table 1. Sociodemographic characteristics of the sample (N=9600).
Variable and grouping
Sample size, n (%)
Sex (female)
4800 (50)
Age
Youth
2400 (25)
Middle-aged
2400 (25)
Older middle-aged
2400 (25)
Elderly
2400 (25)
Residence
Rural
4800 (50)
Urban
4800 (50)
Education
Lower education
4800 (50)
Higher education
4800 (50)
Income
Low income
3200 (33.33)
Middle income
3200 (33.33)
High income
3200 (33.33)
Table 2. Overall prevalence of nursing care themes (N=9600).
Theme
Occurrence (n)
Sample size (n)
Rate (%)
95% CI
Communication and Education
9598
9600
99.98
99.92-100
Emotional Support and Stress Management
9597
9600
99.97
99.91-99.99
Technical Support and IV Management
8802
9600
91.69
91.12-92.23
Safety Management with Risk Control
8574
9600
89.31
88.68-89.92
Family Support
6990
9600
72.81
71.91-73.70
Environmental Adjustment
6568
9600
68.42
67.48‐69.35
Pain and Medication Management
4594
9600
47.85
46.85-48.86
Nurse Training and Event Analysis
3775
9600
39.32
38.34-40.31
Figure 2. Overall distribution of nursing themes across 9600 outputs. Note: The 95% CIs for the ‘communication and education’ (99.92%‐100%) and ‘Emotional Support and Stress Management’ (99.91%‐99.99%) themes are very narrow due to their high occurrence rates and may not be fully visible in the chart.
Associations Between Demographics and Thematic Content
Univariate Analysis of Thematic Distribution
The univariate associations between sociodemographic characteristics and the prevalence of the 8 nursing themes are detailed in Table S1 in . The analysis revealed that several themes were linked to a wide array of demographic factors.
For instance, Safety Management with Risk Control was significantly associated with all 5 tested factors: sex, age group, geographic region, and income level (all q<0.001), as well as educational attainment (q=0.002). Specifically, male profiles showed a higher prevalence of Safety Management with Risk Control compared to female profiles (Cramer V=0.15, q<0.001). Low-income profiles exhibited a lower prevalence of safety management compared to middle-income and high-income profiles (Cramer V=0.08, q<0.001). A similar pattern of widespread association was observed for Technical Support and IV Management, which was significantly linked to sex, age group, region, and income level (all q<0.001), in addition to education (q=0.030).
Multivariable Analysis of Factors Associated With Theme Presence
Our multivariable analysis adjusted for all sociodemographic factors. The results revealed systematic and complex patterns of bias in the LLM’s outputs ( and Table S2 in ). Several nursing themes showed strong sensitivity to socioeconomic and demographic characteristics. These findings highlighted a clear disparity.
Figure 3. Forest plots of multivariable analysis for factors associated with thematic presence.
Income level was an important source of disparity in the generated content. Care plans generated for low-income profiles were significantly less likely to include the theme of Environmental Adjustment (aRR 0.90, 95% CI 0.87-0.93; q<0.001) compared to high-income profiles.
Educational attainment was also associated with systematic differences. Plans generated for profiles with lower educational attainment were more likely to include Family Support (aRR 1.10, 95% CI 1.08-1.13; q<0.001).
Patient age was also a strong predictor of thematic content. Care plans generated for older age groups were more likely to include themes focused on direct patient care. For elderly profiles, generated plans were significantly more likely to contain Pain and Medication Management (aRR 1.33, 95% CI 1.26-1.41; q<0.001) and Family Support (aRR 1.62, 95% CI 1.56-1.68; q<0.001). Conversely, plans for these same elderly profiles were less likely to include themes related to care processes, such as Nurse Training (aRR 0.78, 95% CI 0.73-0.84; q<0.001).
Sex was a significant predictor. Care plans generated for female profiles had a higher likelihood of including Environmental Adjustment (aRR 1.14, 95% CI 1.11-1.17; q<0.001), these same profiles were linked to a lower likelihood of including Safety Management (aRR 0.90, 95% CI 0.89-0.91; q<0.001) and Nurse Training (aRR 0.86, 95% CI 0.82-0.90; q<0.001).
The geographic region also showed independent effects. Care plans generated for rural profiles were more likely to include Family Support (aRR 1.23, 95% CI 1.20-1.26; q<0.001). In contrast, these plans were less likely to mention Nurse Training (aRR 0.78, 95% CI 0.74-0.82; q<0.001).
Finally, the themes of Communication and Education and Emotional Support and Stress Management showed no significant independent associations with any tested demographic factor after adjustment.
Expert Assessment of Care Plan Quality
Subsample Characteristics and Overall Quality Scores
The stratified subsample selected for expert review comprised 500 nursing care plans. The sociodemographic profile of this subsample is detailed in Table S3 in . The distribution was nearly balanced for sex (female: n=247, 49.4%) and education ( lower education: n=248, 49.6%). Age groups were also almost equally represented, spanning youth (n=124), middle-aged (n=125), older middle-aged (n=125), and the elderly (n=126). There was a slight majority of urban profiles (n=260, 52.0%), and the 3 income tiers were comparable in size.
The descriptive statistics for the expert-assigned quality scores are presented in Table S3 in . The overall mean quality score across all dimensions was 4.47 (SD 0.26). Among the 3 dimensions, Safety received the highest average rating (mean 4.55, SD 0.47), followed by Completeness (mean 4.49, SD 0.48) and Clinical Applicability (mean 4.37, SD 0.46). Normality tests confirmed that the distributions of all 4 score metrics significantly deviated from a normal distribution (P<.001 for all).
To aid interpretation of the expert-rated scores, we provide illustrative excerpts in . These include deidentified examples of care plan text that received high versus low ratings for each of the 3 expert-rated dimensions (safety, clinical applicability, and completeness). All excerpts were lightly edited for brevity.
Interrater Reliability
The interrater reliability for the quality assessment was confirmed to be robust. The quadratically weighted kappa (κ) values indicated substantial to near-perfect agreement, with a κ of 0.81 (95% CI 0.762‐0.867) for Completeness, 0.773 (95% CI 0.704‐0.831) for Clinical Applicability, and 0.761 (95% CI 0.704‐0.813) for Safety.
This high level of consistency was further supported by the single-rater ICC [1,2], which showed a highly similar pattern of reliability (Completeness: 0.817, Applicability: 0.773, and Safety: 0.762). Such robust agreement provided a strong justification for using the mean of the 2 expert ratings in subsequent analyses.
Associations Between Demographics and Quality Scores
To identify independent predictors of care plan quality, we constructed a series of multivariable linear regression models. After adjusting for all sociodemographic factors, several characteristics emerged as significant predictors for different quality dimensions (). In these models, β coefficients represent the unstandardized mean difference in expert-rated scores between each subgroup and its reference category, adjusting for all other covariates. For example, a β of .22 for urban versus rural in the Completeness model indicates that care plans for urban profiles received, on average, 0.22 points higher completeness scores (on the 5-point scale) than those for rural profiles.
Table 3. Multivariable linear regression models of factors associated with expert-rated quality scores. Notes: β represents unstandardized regression coefficients estimated using ordinary least squares (OLS) regression with robust SEs. Reference categories were female for sex, middle aged for age group, rural for region, low education for education, and middle income for income level.
Predictor
Completeness, β (95% CI)
Clinical applicability, β (95% CI)
Safety, β (95% CI)
Sex
Male versus female (Ref.)
0.05 (−0.02 to 0.13)
−0.02 (−0.10 to 0.05)
0.34 (0.26 to 0.42)
Age group
Young adult versus middle aged (Ref.)
−0.09 (−0.20 to 0.02)
−0.12 (−0.23 to 0.01)
0.09 (−0.02 to 0.20)
Older middle aged versus middle aged (Ref.)
0.00 (−0.11 to 0.12)
−0.02 (−0.14 to 0.09)
−0.03 (−0.14 to 0.08)
Elderly versus middle-aged (Ref)
0.10 (−0.01 to 0.21)
−0.09 (−0.20 to 0.02)
−0.03 (−0.14 to 0.08)
Region
Urban versus rural (Ref.)
0.22 (0.14 to 0.30)
0.14 (0.07 to 0.22)
−0.09 (−0.17 to 0.01)
Education
High education versus low (Ref.)
−0.07 (−0.15 to 0.01)
−0.03 (−0.11 to 0.05)
−0.02 (−0.10 to 0.06)
Income level
Low income versus middle (Ref.)
0.33 (0.23 to 0.43)
0.18 (0.08 to 0.28)
−0.02 (−0.12 to 0.07)
High income versus middle (Ref.)
0.01 (−0.08 to 0.11)
−0.04 (0.13 to 0.06)
−0.02 (−0.11 to 0.07)
aRef.: reference.
b*P<.05.
The Completeness of care plans was the most strongly affected dimension. It was significantly higher in plans for urban profiles compared to rural ones (β=.22, 95% CI 0.14-0.30; P<.001). Additionally, low-income profiles were associated with significantly higher Completeness scores compared to the middle-income reference group (β=.33, 95% CI 0.23-0.43; P<.001).
For Clinical Applicability, urban residence (β=.14, 95% CI 0.07-0.22; P=.001) and low-income status (β=.18, 95% CI 0.08-0.28; P=.002) were also predictors of higher scores. Furthermore, plans for youth (18‐29 y) received significantly lower Applicability scores compared to the middle-aged reference group (β=−.12, 95% CI −0.23 to −0.01; P=.05).
Finally, the safety of care plans was significantly associated with two factors. Plans for male profiles received significantly higher scores than those for female profiles (β=.34, 95% CI 0.26-0.42; P<.001). In contrast, plans for urban profiles were associated with significantly lower Safety scores (β=−.09, 95% CI −0.17 to −0.01; P=.048). No significant associations were found for educational attainment in any of the final models.
Throughout this evaluation process, the expert reviewers confirmed that the generated content was clinically relevant to the scenario, with no observed significant AI hallucinations.
Discussion
Principal Findings
This study investigated sociodemographic bias in nursing care plans generated by GPT-4. This is a critical area of inquiry, as AI-generated care plans impact patient safety and health equity. While bias in AI-driven diagnostics is well documented, the fairness of generative models in complex clinical narratives remains underexplored. Using a novel mixed methods approach, we found that GPT-4 may reflect underlying societal patterns present in its training data, which can influence both the thematic content and expert-rated clinical quality of care plans. Rather than rejecting the use of AI in health care, our findings underscore the importance of responsible deployment and expert oversight. Our findings reveal a dual form of bias. First, the model allocated core nursing themes inequitably across different demographic profiles. Second, we found a paradoxical pattern. Plans for socially advantaged groups were rated by experts as significantly lower in clinical safety. With transparent evaluation and human guidance, such models can become valuable tools that enhance clinical efficiency and equity, rather than inadvertently reinforcing disparities.
Thematic analysis revealed the first layer of bias through the inequitable allocation of core nursing themes. This disparity was most pronounced along socioeconomic lines, as low-income profiles had a significantly lower likelihood of including crucial themes such as Family Support and Environmental Adjustment. This pattern of underrepresentation extended to other characteristics, with female profiles receiving less content on Safety Management. These patterns are unlikely to be a random artifact. They reflect a digital reproduction of structural inequities learned from the model’s training data. This raises a critical concern. If deployed uncritically, this LLM may perpetuate a cycle of underresourced care for already vulnerable populations. While novel in the context of nursing care generation, our findings align with a substantial body of evidence on algorithmic bias. For example, prior work has established lower diagnostic accuracy on chest X-rays for minority populations [,]. In clinical NLP, models have replicated sexed language, describing female patients with more emotional terms and male patients with technical ones []. Predictive algorithms have also systematically underestimated health care costs for low-income patients due to historical underresourcing []. Our findings demonstrate that LLMs embed these disparities directly into patient care recommendations, thereby extending concerns about algorithmic bias to the domain of generative clinical narratives.
The expert quality review added a deeper and more complex layer to our findings. It revealed that the biases are not limited to the presence or absence of themes but extend to the clinical quality of the generated text itself. Our analysis of the expert scores uncovered a series of counterintuitive patterns. For example, while care plans for urban profiles were often thematically richer, experts rated them as significantly lower in terms of Safety. Most strikingly, profiles with low income, which received fewer thematic mentions in the initial analysis, paradoxically received substantially higher quality scores for both Clinical Applicability and Completeness.
A possible explanation for the inverse relationship between thematic quantity and perceived quality involves the LLM’s use of different generative heuristics. Such heuristics can cause AI models to internalize and apply societal stereotypes, as documented in prior literature [,]. Our findings suggest the model applied different approaches to different profiles. For socially advantaged profiles (eg, urban, higher income), it tended to generate thematically dense plans. The increased complexity of these plans may introduce more potential for error, a known principle in safety science []. This could explain their lower expert-rated safety scores. Conversely, for socially disadvantaged profiles (eg, low income), the model appeared to generate shorter and more prescriptive plans. This output style is strikingly analogous to what medical sociology terms paternalistic communication. This communication pattern is characterized by providing direct, simplified instructions while omitting complex rationales or shared decision-making options, often based on an implicit assumption about the patient’s lower health literacy or agency []. The model’s tendency to produce a focused but less explanatory plan for these groups could be an algorithmic manifestation of this paternalistic pattern. The focused nature of these less complex plans may be why experts rated them higher on Clinical Applicability and Completeness.
The direct clinical implication of our findings is that current-generation LLMs such as GPT-4 are not yet suitable for fully autonomous use in generating nursing care plans []. Our results demonstrate that deploying these models without a robust human-in-the-loop review process could introduce significant risks []. Specifically, it may lead to the provision of care that is systematically biased [], either through the omission of key nursing themes or through qualitatively substandard recommendations for certain patient groups. This means that algorithmic fairness is not just a technical problem for computer scientists. It is a fundamental issue of patient safety. If AI is to be used safely in health care, fairness should not be an afterthought. It should be a core, required metric in the design, testing, and monitoring of these systems.
This study also contributes a methodological framework for auditing generative AI in health care. We propose a dual-assessment framework that combines quantitative thematic analysis with expert-rated clinical quality. Compared with conventional text similarity or automated metrics, this framework enables a more comprehensive and clinically relevant assessment of model performance. Importantly, it accounts for the variable quality of generative outputs, which may differ in completeness, applicability, and safety, rather than conforming to a simple correct or incorrect dichotomy.
Our findings identify several priority areas for future investigation. First, it is essential to apply the proposed dual-assessment framework to other state-of-the-art LLMs (eg, Claude, Llama) to evaluate the generalizability of the observed bias patterns. Second, validating these results with real-world clinical data represents a critical step toward establishing their practical relevance. Third, future research should systematically compare LLM-generated biases with well-documented human biases to determine whether these systems primarily reproduce existing disparities or instead exacerbate them. Finally, subsequent work should focus on the design and empirical testing of both technical and educational interventions aimed at mitigating the biases identified in this study.
Strengths and Limitations
This study offers notable strengths. Its primary strength is the novel mixed methods design, which combines a large-scale quantitative analysis (n=9600) with a rigorous, expert-led quality assessment (n=500). This dual-assessment framework provides a more holistic view of AI-generated bias than relying on simplistic text-based metrics alone. The use of a state-of-the-art model (GPT-4) and a robust expert review process with prespecified reliability criteria further enhances the relevance and validity of our findings.
However, we must acknowledge several limitations. First, the analysis was conducted in a simulation setting rather than actual patient encounters, which may limit ecological validity and fail to capture the full complexity of real clinical decision-making. Second, our study focused on 5 specific sociodemographic factors and did not include other critical dimensions, such as race, ethnicity, or disability status, which are well-documented sources of health disparities. Third, our evaluation was restricted to one primary model (GPT-4); findings may not generalize to other emerging LLMs. Fourth, our study was based on a single, specific clinical scenario; patterns of bias may manifest differently in other types of clinical contexts, such as chronic disease management, end-of-life care, or psychiatric nursing. Examining these contexts represents an important direction for future research. Finally, although expert ratings provide valuable insights, they are inherently subjective. Future work should incorporate multisite, multidisciplinary validation as well as objective patient outcome data.
Conclusions
Our research demonstrates that a state-of-the-art LLM systematically reproduces complex sociodemographic biases when generating nursing care plans. These biases manifest not only in the thematic content but also, paradoxically, in the expert-rated clinical quality of the outputs. This finding challenges the view of LLMs as neutral tools. It highlights a significant risk. Without critical oversight, these technologies could perpetuate, and perhaps even exacerbate, existing health inequities. Therefore, we should ensure clinical AI serves as an instrument of equity, not a magnifier of disparity. Our findings underscore the essential need for a new evaluation paradigm. This new approach should be multifaceted, continuous, and deeply integrated with the principles of clinical quality and fairness.
The authors declare the use of generative artificial intelligence tools during manuscript preparation. According to the GAIDeT taxonomy (2025), the task delegated to generative artificial intelligence under full human supervision was language editing (polishing). The tool used was GPT-5.0. Responsibility for the final content lies entirely with the authors. GAI tools are not listed as authors and do not bear responsibility for the final outcomes.
This study was supported by the National Natural Science Foundation of China (grant 72474166).
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
Edited by Alicia Stone; submitted 27.May.2025; peer-reviewed by Karthik Sarma, Ravi Teja Potla, Sandipan Biswas, Vijayakumar Ramamurthy, Wenhao Qi; final revised version received 06.Nov.2025; accepted 10.Nov.2025; published 05.Dec.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
In recent years, photography and videography have evolved rapidly — sensors are stronger, and even phones can shoot impressive footage. Yet we still run into familiar challenges: low light makes photos grainy, fast action blurs faces,…