Providing clinicians with artificial intelligence (AI) predictions along with model explanations can boost diagnostic accuracy, but that accuracy drops when using a biased AI model -- and explanations don't mitigate those negative effects, according to a randomized clinical vignette survey study.
Clinicians who were asked to differentiate between pneumonia, heart failure, or chronic obstructive pulmonary disease (COPD) had a baseline accuracy of 73% (95% CI 68.3-77.8), which rose to about 76% when using AI without explanations. Adding in an explanation of the model -- to mitigate any errors it may make -- brought accuracy up to about 78%, according to Michael Sjoding, MD, of University of Michigan Health in Ann Arbor, and colleagues.
However, when using a systematically biased AI model, diagnostic accuracy fell to about 62% -- a decline that wasn't fixed by adding model explanations, which only restored accuracy to about 64%, they reported in .
"AI is being developed at an extraordinary rate, and our study shows that it has the potential to improve clinical decision making," co-author Sarah Jabbour, MSE, also of the University of Michigan, told 51˶ in an email. "But we must be thoughtful about how to carefully integrate AI into clinical workflows with the goal of improving clinical care while not introducing systematic errors or harming patients."
Jabbour and colleagues explained that recent regulatory guidance has called for AI models to include explanations that may help mitigate errors made by models, but the effectiveness of this strategy hasn't been established.
To dig deeper into the issue, they designed a that involved analyzing vignettes of patients hospitalized with acute respiratory failure and diagnosing either pneumonia, heart failure, or COPD as the underlying cause.
To establish baseline diagnostic accuracy, clinicians -- doctors, nurse practitioners, and physician assistants -- were shown two vignettes without AI input. They were then randomized to see six vignettes with AI input, with or without model explanations. Three of these vignettes included standard-model predictions, while three included systematically biased predictions.
A final vignette with support from an "expert clinician" consult established an upper bound for diagnostic accuracy, which in this case was about 81%.
Overall, 457 clinicians from 13 states participated between April 2022 and January 2023, with 231 randomized to model predictions without explanations, and 226 randomized to also get explanations. Median participant age was 34 and about 58% were female.
The vignettes included presenting symptoms, physical examination, laboratory results, and chest radiographs. Biased AI models were skewed to make predictions using age, weight, and radiograph adjustments. Explanations for biased models were written to try to reveal potential biases, the researchers said.
"In our study, explanations were presented in a way that were considered to be obvious, where the AI model was completely focused on areas of the chest X-rays unrelated to the clinical condition," Jabbour told 51˶. "We hypothesized that, if presented with such explanations, the participants in our study would notice that the model was behaving incorrectly and not rely on its predictions."
"This was surprisingly not the case," she said, "and the explanations when presented alongside biased AI predictions had seemingly no effect in mitigating clinicians' over-reliance on biased AI."
AI models can reflect biases that already exist in the healthcare system and can lead to worse overall clinical decision making, Jabbour said. For example, a model trained on a dataset including women who were underdiagnosed for heart failure might identify female patients as being at lower risk for heart failure, especially if clinicians rely heavily on the predictive models without further explanation, she said.
One limitation of the study was its reliance on a web-based survey interface, which is inherently different from the clinical setting. Nor did the study include radiologists who have more training on reading diagnostic imaging, the researchers noted.
In an , Rohan Khera, MD, MS, of the Yale School of Medicine in New Haven, Connecticut, and colleagues said the results suggest that "a more careful approach to evaluating AI tools is warranted before their rapid adoption, even when AI is used as assistive technology."
"Even in controlled settings, without the usual pressures on time, clinicians favored automated decision-making systems, relying on the AI-based tool, despite the presence of contradictory or clinically nonsensical information," Khera and colleagues wrote.
"If a model performs well for certain patients or in certain care scenarios, such automation bias may result in patient benefit in those settings," they wrote. "However, in other settings where the model is inaccurate -- either systematically biased or due to imperfect performance -- patients may be harmed as clinicians defer to the AI model over their own judgment."
Disclosures
The study was funded by the National Heart, Lung, and Blood Institute (NHLBI).
The study authors reported financial relationships with the U.S. Department of Energy, Toyota Research Institute, the National Science Foundation, the Alfred P. Sloan Foundation, Machine Learning for Healthcare, and Airstrip.
Editorialists reported financial relationships with the Doris Duke Charitable Foundation, Bristol Myers Squibb, Novo Nordisk, Evidence2Health, Ensight-AI, Johnson & Johnson, the Medical Devices Innovation Consortium, and Arnold Ventures.
Primary Source
JAMA
Jabbour S, et al "Measuring the impact of AI in the diagnosis of hospitalized patients: A randomized clinical vignette survey study" JAMA 2023; DOI: 10.1001/jama.2023.22295.
Secondary Source
JAMA
Khera R, et al "Automation bias and assistive AI: Risk of harm from AI-driven clinical decision support" JAMA 2023; DOI:10.1001/jama.2023.22557.