Research published on October 30 in Clinical Imaging reveals that ChatGPT demonstrates modest accuracy in assigning BI-RADS scores for mammograms and breast ultrasound exams, indicating a significant step forward in utilizing artificial intelligence in radiological assessments.
A research team led by Marc Succi, MD, from Mass General Brigham in Boston has found that two iterations of the large language model (LLM) successfully assigned correct BI-RADS scores in approximately two out of three cases, with notably improved performance in high-risk BI-RADS 5 cases. The models struggled significantly when tasked with assigning lower BI-RADS categories, underscoring a potential area for further development.
“These findings provide breast radiologists with a valuable foundation for understanding the current capabilities and limitations of off-the-shelf large LLMs in image interpretation,” Succi remarked in an interview with AuntMinnie.com.
Previous studies suggest that these advanced large language models can accurately recommend suitable imaging modalities based on a patient’s clinical presentation, and have also shown proficiency in determining BI-RADS categories based solely on textual imaging reports, as highlighted in an earlier 2024 study.
Succi and his colleagues conducted a comprehensive pilot study aimed at exploring whether ChatGPT-4 and its enhanced version ChatGPT-4o—which employs multimodal processing—can effectively assist in generating accurate BI-RADS scores from mammographic and breast ultrasound images.
The team rigorously tested both models using a diverse set of 77 breast cancer images obtained from radiopaedia.org and employed a method where images were analyzed in separate sessions to mitigate biases in the assessment.
Across all BI-RADS cases, both ChatGPT-4 and ChatGPT-4o achieved an accuracy rate of 66.2%. However, this accuracy was not consistent across various BI-RADS categories. Remarkably, the models revealed superior performance when assessing BI-RADS 5 cases, recording accuracy rates of 84.4% for GPT-4 and an impressive 88.9% for GPT-4o. Conversely, both models registered a startling 0% accuracy when tasked with categorizing BI-RADS 3 instances, and also faced challenges with the lower BI-RADS 1 and 2 categories.
“The models were able to handle high-risk cases effectively but tended to overestimate the severity of lower-risk cases,” Succi explained, highlighting a critical limitation in the models’ performance.
When the grading errors for BI-RADS 1 to 3 were assessed, the results were striking: 64.2% of GPT-4’s misclassifications and a staggering 76.4% for GPT-4o were two grades higher than the correct score. The analysis found that interrater agreement relative to the ground truth stood at 0.72 for GPT-4 and 0.68 for GPT-4o, reflecting some consistency in the models’ assessments.
Notably, the results also indicated that both models performed better with mammograms, achieving an accuracy of 67.6%, compared to a lower 55.6% for ultrasound images, further illustrating the varying efficacy of AI in different imaging modalities.
Succi pointed out that the subtle differences in lower-risk cases may pose a significant challenge for LLMs to distinguish accurately. “Additionally, the models might have been trained on datasets that contain more high-risk cases, potentially influencing their accuracy,” he added, hinting at the influence of learning data on model performance.
The research team remains dedicated to exploring innovative ways that LLMs can effectively support clinicians, with ongoing projects examining diverse applications both within and outside radiology. “We’re particularly interested in applications of AI for patient triage and patient education,” he shared with AuntMinnie.com.
The complete study is accessible here.
**Interview with Dr. Marc Succi on the Use of ChatGPT for BI-RADS Assessment**
**Editor:** Thank you, Dr. Succi, for joining us today. Your recent research published in *Clinical Imaging* reveals some promising results regarding the use of ChatGPT in assigning BI-RADS scores. Can you summarize the main findings for our audience?
**Dr. Succi:** Thank you for having me. Our study focused on the ability of ChatGPT-4 and its enhanced version, ChatGPT-4o, to accurately assign BI-RADS scores to mammograms and breast ultrasounds. We found that these models demonstrated a modest overall accuracy of about 66.2% across various BI-RADS categories. Notably, they performed considerably better with high-risk BI-RADS 5 cases, achieving accuracy rates of around 84.4% for GPT-4 and 88.9% for GPT-4o.
**Editor:** That’s impressive! However, you mentioned challenges with lower BI-RADS categories. Can you elaborate on that?
**Dr. Succi:** Certainly. While the models excelled in identifying high-risk cases, they struggled significantly with lower BI-RADS categories, particularly BI-RADS 3, where they recorded a concerning 0% accuracy. This highlights the limitations of current AI models in certain contexts and suggests that more work is needed to enhance their capabilities in this area.
**Editor:** What do you think this means for radiologists and the future of AI in medical imaging?
**Dr. Succi:** Our findings provide a crucial insight into the potential and limitations of using large language models in image interpretation. As AI continues to advance, these tools can certainly assist radiologists by generating preliminary BI-RADS scores or even recommending imaging modalities. However, these models should be viewed as complementary tools that require oversight from trained professionals.
**Editor:** You reference previous studies suggesting the models can recommend imaging modalities based on clinical presentations. How do these findings fit into the larger picture of AI in healthcare?
**Dr. Succi:** These studies reinforce the idea that AI can significantly contribute to the healthcare field by processing vast amounts of data and improving decision-making. However, as our study shows, while these models are promising, they still have limitations that need to be addressed to ensure safe and effective integration into clinical practice.
**Editor:** Thank you, Dr. Succi, for your insights. It sounds like an exciting time for AI in radiology, albeit with challenges ahead.
**Dr. Succi:** Thank you! It’s vital we continue investigating these tools to optimize patient care while ensuring accuracy and safety in interpretations.