Accuracy and Readability of Large Language Model's Responses to Frequently Asked Questions About 5 Common Interventional Oncological Procedures


Yarol R. C., Sarıoğlu O., Cantürk A., Gülcü A.

Türk Girişimsel Radyoloji Derneği 2025 Yıllık Kongresi & EVIS Uluslararası Ortak Toplantısı, Antalya, Türkiye, 4 - 08 Nisan 2025, ss.69-71, (Tam Metin Bildiri)

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Basıldığı Şehir: Antalya
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.69-71
  • Dokuz Eylül Üniversitesi Adresli: Evet

Özet

AIM: To evaluate the accuracy and readability of three large language models (ChatGPT-4.0, Google Gemini 1.5 Pro, and Claude 3.5 Sonnet) in responding to frequently asked ques�ons about five interven�onal oncological procedures. METHODS: Fi�y frequently asked ques�ons related to five commonly performed interven�onal oncology procedures—image-guided biopsy, radiofrequency abla�on (RFA), transarterial chemoemboliza�on (TACE), port placement, and transarterial radioemboliza�on (TARE)—were presented to three large language models (ChatGPT4.0, Google Gemini 1.5 Pro, and Claude 3.5 Sonnet). Responses were collected from their respec�ve pla�orms and compiled into a comprehensive list. These ques�on-and-answer compila�ons were then anonymized and provided to the evaluators for assessment. These answers were evaluated by three interven�onal radiologists for their appropriateness and accuracy. Performance was quan�fied using a 7-point Likert scale (Figure 1), and interrater reliability was assessed through the intraclass correla�on coefficient. Word and sentence counts, along with readability metrics—including the Flesch Reading Ease score and Flesch–Kincaid Reading Grade Level—were calculated. The average score and standard devia�on for each chatbot were calculated. The data were analyzed using Kruskal-Wallis, as it is nonparametric. RESULT: The 7-point Likert scale was u�lized to determine appropriateness and accuracy. The responses generated by ChatGPT-4.0 had the highest average score of 5.84, followed by Gemini 1.5 Pro with 5.67, and lastly, Claude 3.5 Sonnet with a score of 5.65.The intraclass correla�on coefficient between reviewers was 0.38, reflec�ng moderate interrater reliability. ChatGPT-4.0 had the highest average word count at 280.12, followed by Google Gemini 1.5 Pro with 254.4, and Claude 3.5 Sonnet with 182.16. Similarly, sentence counts averaged 21.34 for ChatGPT-4.0, 15.04 for Gemini, and 9.44 for Claude. The Flesch Reading Ease scores were 30.61, 41.7, and 33.35, while the FleschKincaid Grade levels were 12.59, 11.84, and 13.66, respec�vely. These results indicate that all three large language models' responses are at a college-level difficulty for readability. Gemini had the highest readability, while ChatGPT4.0 had the lowest. CONCLUSIONS: Large language models generally provide accurate responses to frequently asked ques�ons about interven�onal oncology procedures. However, these responses o�en require a high reading level to fully comprehend. ChatGPT-4.0 provided the most appropriate and accurate responses, while Claude 3.5 Sonnet delivered the most easily readable answers