51˶

AI Not Ready to Replace Radiologists Interpreting Chest X-Rays

— Commercial tools show low sensitivity in some cases

MedpageToday
 A photo of a mature male radiologist looking at a chest x-ray on a computer.

Commercially available artificial intelligence (AI) tools were accurate to varying degrees in flagging chest x-ray abnormalities but turned up more false-positives than radiology reports, a Danish study found.

Testing four CE-marked AI tools on real-world radiographs from the Copenhagen region, investigators reported areas under the receiver operating characteristic curves ranging from 0.83-0.88 for airspace disease, 0.89-0.97 for pneumothorax, and 0.94-0.97 for pleural effusion using radiology reports as reference.

Louis Plesner, MD, of the University of Copenhagen, Denmark, and coauthors found a wide range of sensitivity and specificity values for the AI tools:

  • Annalise Enterprise CXR (version 2.2): sensitivity 72% and specificity 86% for airspace disease; 90% and 98% for pneumothorax; 95% and 83% for pleural effusion
  • SmartUrgences (version 1.24 with high sensitivity threshold): sensitivity 91% and specificity 62% for airspace disease; 73% and 99% for pneumothorax; 78% and 92% for pleural effusion
  • ChestEye (version 2.6): sensitivity 80% and specificity 76% for airspace disease; 78% and 98% for pneumothorax; 68% and 97% for pleural effusion
  • AI-Rad Companion (version 10): sensitivity 79% and specificity 72% for airspace disease; 71% and 98% for pneumothorax; 80% and 92% for pleural effusion

"Among the AI tools examined in this study, we observed an acknowledgeable difference in the balance between sensitivity and specificity for the individual tools, which seems unpredictable. Therefore, when implementing an AI tool, it seems crucial to understand the disease prevalence and severity of the site and that changing the AI tool threshold after implementation may be needed for the system to have the desired diagnostic ability," the group wrote in .

"Furthermore, the low sensitivity observed for several AI tools in our study suggests that, like clinical radiologists, the performance of AI tools decreases for more subtle findings on chest radiographs," the study authors noted.

According to the American College of Radiology, in radiology.

"While AI tools are increasingly being approved for use in radiological departments, there is an unmet need to further test them in real-life clinical scenarios. AI tools can assist radiologists in interpreting chest x-rays, but their real-life diagnostic accuracy remains unclear," said Plesner in a press release.

In their study, the authors reported that in chest radiographs with four or more findings, AI's specificity dipped to the 27-69% range for airspace disease, 96-99% for pneumothorax, and 65-92% for pleural effusion.

Ultimately, Plesner stated that while the tools are useful, potentially providing a confidence boost for radiologists, they should not be autonomous in regards to making a diagnosis for patients.

Masahiro Yanagawa, MD, PhD, and Noriyuki Tomiyama, MD, PhD, both of the Osaka University Graduate School of Medicine in Japan, agreed, emphasizing the limits of AI in this setting.

"Given that anteroposterior chest radiographs and chest radiographs with multiple findings reduced the specificity of AI tools, radiologists should be aware of the limitations of the tools with respect to both sensitivity and specificity. Care must be taken not to overestimate the results of AI tools in such challenging cases," Yanagawa and Tomiyama wrote in an .

For their retrospective study, Plesner and colleagues invited AI vendors to test their algorithms on real-world chest x-rays from four hospitals within the Copenhagen region. The radiographs had come from 2,040 consecutive adult patients (50.6% women; average age 72 years).

Four out of seven invited AI vendors agreed to participate and have their AI tools compared to radiologist-made clinical radiology reports for reference.

All four AI tools produced significantly more false positives. For example, in identifying airspace disease, false positives ranged from 13.7% with the Annalise algorithm to 36.9% with the SmartUrgences. For comparison, radiologists had a false positive rate of 11.6%.

Only the SmartUrgences algorithm, when tuned to high specificity, did not produce more false positives than radiologists in flagging pneumothorax and pleural effusion.

False negative rates varied widely depending on the finding and the AI tool.

One limitation was that radiologists had access to clinical information, lateral chest radiographs, and prior imaging that the AI tools did not, potentially giving them an "unfair advantage," the authors said. Other possible limitations of the study included the lack of AI evaluation of lateral chest radiographs and that the findings may not be applicable to non-hospital settings.

  • author['full_name']

    Elizabeth Short is a staff writer for 51˶. She often covers pulmonology and allergy & immunology.

Disclosures

This study was supported by research grants from the Danish government.

Plesner reported a relationship with Siemens Healthineers. Coauthors disclosed relationships from Siemens Healthineers, Innovation Fund Denmark, Roche, Orion, Pharmacosmos, Novartis, Bavarian Nordic, Merck, Philips Healthcare, and Boehringer Ingelheim, and one coauthor is employed by Novo Nordisk.

Yanagawa disclosed grant support from the Japan Society for the Promotion of Science and the Japan Agency for Medical Research and Development. He is also associate editor for Radiology: Artificial Intelligence. Tomiyama had no disclosures.

Primary Source

Radiology

Plesner LL, et al "Commercially available chest radiograph AI tools for detecting airspace disease, pneumothorax, and pleural effusion" Radiology 2023; DOI: 10.1148/radiol.231236.

Secondary Source

Radiology

Yanagawa M, Tomiyama N "Clinical performance of current-generation AI tools for chest radiographs" Radiology 2023; DOI: 10.1148/radiol.232139.