ChatGPT's Accuracy in Assigning BI-RADS Scores for Mammograms and Ultrasound Exams

Research published on October 30 in Clinical Imaging reveals that ChatGPT demonstrates modest accuracy in assigning BI-RADS scores for mammograms and breast ultrasound exams, indicating a significant step forward in utilizing artificial intelligence in radiological assessments.

A research team led by Marc Succi, MD, from Mass General Brigham in Boston has found that two iterations of the large language model (LLM) successfully assigned correct BI-RADS scores in approximately two out of three cases, with notably improved performance in high-risk BI-RADS 5 cases. The models struggled significantly when tasked with assigning lower BI-RADS categories, underscoring a potential area for further development.

“These findings provide breast radiologists with a valuable foundation for understanding the current capabilities and limitations of off-the-shelf large LLMs in image interpretation,” Succi remarked in an interview with AuntMinnie.com.

Previous studies suggest that these advanced large language models can accurately recommend suitable imaging modalities based on a patient’s clinical presentation, and have also shown proficiency in determining BI-RADS categories based solely on textual imaging reports, as highlighted in an earlier 2024 study.

Succi and his colleagues conducted a comprehensive pilot study aimed at exploring whether ChatGPT-4 and its enhanced version ChatGPT-4o—which employs multimodal processing—can effectively assist in generating accurate BI-RADS scores from mammographic and breast ultrasound images.

The team rigorously tested both models using a diverse set of 77 breast cancer images obtained from radiopaedia.org and employed a method where images were analyzed in separate sessions to mitigate biases in the assessment.

Across all BI-RADS cases, both ChatGPT-4 and ChatGPT-4o achieved an accuracy rate of 66.2%. However, this accuracy was not consistent across various BI-RADS categories. Remarkably, the models revealed superior performance when assessing BI-RADS 5 cases, recording accuracy rates of 84.4% for GPT-4 and an impressive 88.9% for GPT-4o. Conversely, both models registered a startling 0% accuracy when tasked with categorizing BI-RADS 3 instances, and also faced challenges with the lower BI-RADS 1 and 2 categories.

“The models were able to handle high-risk cases effectively but tended to overestimate the severity of lower-risk cases,” Succi explained, highlighting a critical limitation in the models’ performance.

When the grading errors for BI-RADS 1 to 3 were assessed, the results were striking: 64.2% of GPT-4’s misclassifications and a staggering 76.4% for GPT-4o were two grades higher than the correct score. The analysis found that interrater agreement relative to the ground truth stood at 0.72 for GPT-4 and 0.68 for GPT-4o, reflecting some consistency in the models’ assessments.

Notably, the results also indicated that both models performed better with mammograms, achieving an accuracy of 67.6%, compared to a lower 55.6% for ultrasound images, further illustrating the varying efficacy of AI in different imaging modalities.

Succi pointed out that the subtle differences in lower-risk cases may pose a significant challenge for LLMs to distinguish accurately. “Additionally, the models might have been trained on datasets that contain more high-risk cases, potentially influencing their accuracy,” he added, hinting at the influence of learning data on model performance.

The research team remains dedicated to exploring innovative ways that LLMs can effectively support clinicians, with ongoing projects examining diverse applications both within and outside radiology. “We’re particularly interested in applications of AI for patient triage and patient education,” he shared with AuntMinnie.com.

The complete study is accessible here.

**Interview with⁤ Dr. Marc⁣ Succi on the Use⁢ of ChatGPT for BI-RADS Assessment**

**Editor:** Thank you, Dr. Succi, for joining⁣ us⁤ today. Your ⁢recent‍ research published in *Clinical Imaging* reveals some⁢ promising results regarding the use of ChatGPT in assigning BI-RADS scores. Can you summarize the main findings for ⁣our audience?

**Dr. ‌Succi:** Thank you for⁤ having me. ⁤Our study focused on the ‍ability of ChatGPT-4 and⁣ its enhanced version, ChatGPT-4o, to accurately assign BI-RADS scores ⁣to mammograms and breast ultrasounds. We ⁢found that these models demonstrated a ⁤modest overall accuracy of about 66.2% across ‍various BI-RADS categories. Notably, they performed considerably better with high-risk BI-RADS 5 cases, achieving accuracy rates of around 84.4% for GPT-4 and 88.9% for GPT-4o.

**Editor:** That’s impressive! However, you mentioned challenges with lower BI-RADS categories. Can you elaborate on that?

**Dr. Succi:** Certainly. While ‌the ‍models excelled in identifying high-risk cases, ‍they struggled significantly with lower BI-RADS categories, particularly BI-RADS 3, ⁢where they recorded a concerning 0%‍ accuracy. This‍ highlights the limitations of current‍ AI models in certain contexts and suggests⁤ that more work is needed to enhance their capabilities in this ⁣area.

**Editor:** What do you think⁢ this means for radiologists and ‍the future⁣ of AI in medical imaging?

**Dr. Succi:** Our findings provide a crucial insight into ⁤the potential ‍and limitations of using large language models in image interpretation. As AI continues ‌to ‌advance, these tools can certainly assist radiologists by generating preliminary BI-RADS scores or⁢ even recommending imaging modalities. However,⁢ these models should be viewed ⁢as‌ complementary⁢ tools ‍that require oversight from trained⁤ professionals.

**Editor:** You reference previous studies‍ suggesting the models can ⁢recommend imaging modalities based on clinical presentations. How do these findings⁤ fit ‌into the larger picture of AI‍ in healthcare?

**Dr. Succi:** ‌These studies reinforce the ⁢idea that⁢ AI⁣ can significantly contribute‍ to the healthcare field⁣ by processing vast amounts of data and improving decision-making. However, as our study shows, ‌while these models are‍ promising, they still have⁣ limitations ‌that need to be addressed to ‍ensure safe and ‌effective integration into clinical practice.

**Editor:** Thank you, ‍Dr. Succi, for your ⁢insights. It sounds like⁢ an exciting time for AI in radiology, albeit with challenges ahead.

**Dr. Succi:** ⁣Thank you! It’s vital we continue‍ investigating these tools to optimize patient care while ensuring accuracy and safety in⁢ interpretations.

AI ChatGPT

ChatGPT’s Accuracy in Assigning BI-RADS Scores for Mammograms and Ultrasound Exams

Share this:

Record results from the Korea Business Expo: 164 million euros in contract volume in just two days

Winterhaven Grou Officieel Geopend door Wethouder Evert Stellingwerf

You may also like

Leave a Comment Cancel Reply

Adblock Detected