A study by a team of researchers at teaching hospitals in New York State has found that images generated by widely used AI models to depict dermatologic conditions lack diagnostic accuracy and overwhelmingly underrepresent skin of color, raising concerns about the technology’s clinical readiness and potential to perpetuate health disparities.
Lucie Joerg
Courtesy of Albany Medical College
First author Lucie Joerg, an MD candidate at Albany Medical College, and colleagues evaluated 4 generative AI models(Adobe Firefly, ChatGPT-4o, Midjourney, and Stable Diffusion) across 20 common dermatologic conditions and found that 89.8% of the 4,000 images generated featured light skin, with only 10.2% depicting dark skin.
Adobe Firefly was the only model to produce images aligned with US demographic data. It generated 38.1% images of dark skin tones, a distribution that did not significantly differ from national census demographics (χ²(1) = 0.320, P = .572). In contrast, ChatGPT-4o, Midjourney, and Stable Diffusion showed statistically significant underrepresentation of darker skin (6.0%, 3.9%, and 8.7%, respectively; all P <.001).
The study also assessed whether the images could be correctly identified as the intended dermatologic diagnosis. Only 15% of images across all platforms were rated as accurate by two blinded dermatology residents. Adobe Firefly had the lowest accuracy (0.94%), while ChatGPT-4o, Midjourney, and Stable Diffusion performed modestly better at 22%, 12.2%, and 22.5%, respectively.
“Given the rapid adoption of AI technologies in dermatology and the harm of homogeneous, erroneous outputs, this study addresses this knowledge gap by assessing how well popular text-to-image AI programs reflect skin color diversity and accurately depict skin conditions,” Joerg et al wrote.
The 4,000 study images were generated between June and July 2024 using a standardized prompt: “Generate a photo of a person with [skin condition].” Skin tones were evaluated using the Fitzpatrick scale and compared to US Census distributions using chi-square analysis. Two independent raters assessed the AI platforms’ diagnostic accuracy in a randomized 200-image subset, with inter-rater agreement calculated using the kappa statistic.
Technical and Ethical Concerns
The findings highlight both technical and ethical concerns, the authors cautioned. They note that AI-generated dermatologic content may reinforce clinician bias and limit diagnostic accuracy in patients with skin of color—populations already underserved in dermatology. They argue that the clinical utility of AI-generated images depends on diverse training datasets and stronger regulatory oversight to ensure equity in medical AI.
“AI research in dermatology is still in its infancy and encounters a myriad of challenges,” authors of recent review on the future of the technology wrote in Frontiers in Medicine. “Robust, transparent, and equitable AI algorithms are needed in order to truly enhance patient care without introducing new problems.”
Joerg and fellow researchers echo the concern, pointing out that while generative AI has undeniable potential in medicine, “without prompt action to ensure inclusive and accurate datasets, these technologies risk failing the communities they aim to serve,” they wrote. “As AI shapes the future of healthcare, there is a responsibility to uphold fairness and equitable representation. Only through deliberate and diverse design can AI fulfil its promise as a tool for universal health equity.”
References
-
L, Kabakova M, Wang JY, et al. AI-generated dermatologic images show deficient skin tone diversity and poor diagnostic accuracy: An experimental study. J Eur Acad Dermatol Venereol. Published online July 16, 2025. doi: 10.1111/jdv.20849.
-
Omiye J Gui H, Daneshjou R, Cai ZR, Muralidharan V. Principles, applications, and future of artificial intelligence in dermatology. Front Med (Lausanne). 2023;10:1278232. doi: 10.3389/fmed.2023.1278232