September 10, 2025
In a new paper, University of Washington researchers argue that a key standard for deploying medical AI is transparency — that is, using various methods to clarify how a medical AI system arrives at its diagnoses and outputs.iStock
While debate rumbles about how generative artificial intelligence will change jobs, AI is already altering health care. AI systems are being used for everything from drug discovery to diagnostic tasks in radiology and clinical note-taking. A recent survey of 2,206 clinicians found that most are optimistic about AI’s potential to make health care more efficient and accurate, and nearly half of respondents have used AI tools for work.
Yet AI remains plagued with bugs, hallucinations, privacy concerns and other ethical quandaries, so deploying it for sensitive and consequential work comes with major risks. In a review article published Sept. 9 in Nature Reviews Bioengineering, University of Washington researchers argue that a key standard for deploying medical AI is transparency — that is, using various methods to clarify how a medical AI system arrives at its diagnoses and outputs.
UW News spoke with the paper’s three authors about what transparency means for medical AI: co-lead authors Chanwoo Kim and Soham Gadgil, both UW doctoral students in the Paul G. Allen School of Computer Science & Engineering, and senior author Su-In Lee, a professor in the Allen School.
What makes discussions of ethics in medical AI distinct from the broader discussions around AI ethics?
Chanwoo Kim: The biases built into AI systems and the risk of incorrect outputs are critical problems, especially in medicine, because they can directly impact people’s health and even determine life-altering outcomes.
The foundation for addressing those concerns is transparency: being open about the data, training, and testing that went into building a model. Knowing if an AI model is biased starts with understanding the data it was trained on. And the insights gained from such transparency can illuminate sources of bias and pathways for systematically mitigating these risks.
Su-In Lee: A study from our lab is a good example. During the height of the COVID-19 pandemic, there was a surge of AI models that took chest X-rays and then predict whether the patient has COVID-19 or not. In our study, we showed that hundreds of these models were wrong: They were claiming accuracy close to 100% or 99% within some data sets, but in the external hospital data sets, this accuracy went down sharply. This indicates that AI models fail to generalize in real-world clinical settings. We used a technique that revealed that models relied on shortcuts: In the corners of the X-ray images, there are sometimes different kinds of text marks. We showed that the models were using these marks, which led the models to inaccurate results. Ideally, we’d want the models to look at the X-ray images themselves.
Your paper brings up “Explainable AI” as a route to transparency. Can you describe what that is?
SL: Explainable AI as a field started about a decade ago, when people were trying to interpret the outputs from the new generation of complex, “black box” machine learning models.
Here’s an example: Imagine that a bank customer wants to know if they can get a loan. The bank will then use lots of data about that person, including age and occupation and credit score and so on. They’ll feed that data to a model, which will make a prediction about whether this person is going to pay off the loan. A “black box” model would let you see only the result. But if this bank’s model lets you see the factors that led to its decision, you can better understand the reasoning process. That’s the core idea of Explainable AI: to help people better understand AI’s process.
There are a variety of methods, which we explain in our review paper. What I described in the bank example is called a “feature attribution” method. It’s attributing its output back to the input features.
How can regulation help with some of the risks of medical AI?
CK: In the United States, the FDA regulates medical AI under the Software as a Medical Device, or SaMD, framework. Recently, regulators have focused on coming up with a framework to enforce transparency. This includes making clear what AI is designed to do — stating specific use cases for systems and the standards for accuracy and limitations in real clinical settings, which are dependent on knowing how a model works. Also, medical AI is used in clinical settings, where things change dynamically and AI performance can fluctuate. So recent regulations are also trying to ensure that medical AI models are monitored continuously during the deployment time.
Soham Gadgil: New medical devices or drugs go through rigorous testing and clinical trials to be FDA approved. Having regulations for AI systems to undergo similarly rigorous testing and standards is important. Our lab has shown these models, even those that might seem accurate in tests, don’t always generalize in the real world.
In my opinion, many of the organizations developing these models do not have incentives to focus on transparency. Right now, the paradigm is that if your model performs better on certain benchmarks — these sets of specific, standardized, public tests that AI organizations use to compare or rank their models — then it’s good enough to use, and it will probably get good adoption. However, this paradigm is incomplete, since these models can still hallucinate and generate false information. Regulation can help incentivize focusing on transparency along with model performance.
What role do you see clinicians playing in the adoption of AI transparency?
CK: Clinicians are critical in achieving transparency in medical AI. If a clinician uses an AI model to help with a diagnosis or treatment, then they are responsible for explaining the rationale behind a model’s predictions because they are ultimately responsible for the patient’s health. So clinicians need to be familiar with AI models’ techniques and even basic Explainable AI techniques, so that they can understand how the AI models work — not perfectly, but to the extent that they can explain the mechanism to patients.
SG: We collaborate with clinicians for most of our lab’s biomedical research projects. They give us insight on what we should be trying to explain. They tell us when Explainable AI solutions are correct, whether they’re applicable in health care, and ultimately whether these explanations will be useful for patients and clinicians.
What do you want the public to know about AI transparency?
SL: We should not just blindly trust what AI is doing. Chatbots hallucinate sometimes, and medical AI models make mistakes. Last year, in another paper, we audited five dermatology AI systems that you can easily get through an app store. When you see something strange on your skin, you take a picture, and the apps tell you whether that’s melanoma cancer or not. Our work showed that the results were frequently not accurate, much like the COVID-19 AI systems. We used a new type of Explainable AI technique to show why these systems failed in certain ways — what’s behind these mistakes.
SG: The first step toward using AI critically can be simple. For example, if someone uses a generative model to get preliminary medical information for some minor ailment, they could just ask the model itself to give an explanation. While the explanation might sound plausible, it should not be taken at face value. If the explanation points to sources, the user should verify that those sources are trustworthy and confirm that the information is accurate. For anything potentially consequential, clinicians need to be involved. You should not be asking ChatGPT whether you’re having a heart attack.
For more information, contact Lee at suinlee@cs.washington.edu.
Tag(s): Chanwoo Kim • College of Engineering • Paul G. Allen School of Computer Science & Engineering • Soham Gadgil • Su-In Lee