ChatGPT-4 (OpenAI) had an overall “fair” performance when answering multiple-choice ophthalmic questions unrelated to multifocal imaging.
A Canadian study, led by first author, Andrew Mihalache, MD, from the Temerty School of Medicine, University of Toronto, Toronto, Ontario, Canada, reported that ChatGPT-4 (OpenAI) had an overall “fair” performance when answering multiple-choice ophthalmic questions unrelated to multifocal imaging.1
Correct interpretation of clinical images is at the heart of treatment in Ophthalmology to ensure appropriate treatment. With the rapid development of artificial intelligence (AI) globally, the accuracy of technology such as chatbots is imperative.
The authors commented on the importance of this technology, “Ophthalmology is reliant on effective interpretation of multimodal imaging to ensure diagnostic accuracy. Multimodal imaging enhances patient outcomes through earlier and more precise diagnoses, and more effective follow-up visits and treatments.2,3 The new release of the chatbot holds great potential in enhancing the efficiency of ophthalmic image interpretation, which may reduce the workload on clinicians, mitigate variability in interpretations and errors, and ultimately, lead to improved patient outcomes.”
They conducted a cross-sectional study to evaluate how ChatGPT-4 performed when processing imaging data. A publicly available dataset of ophthalmic cases, OCTCases, a medical education platform from the Department of Ophthalmology and Vision Sciences at the University of Toronto, was used. A total of 137 cases were available and 99% had multiple choice questions. the authors explained. The study’s primary outcome was the accuracy of the chatbot in answering these questions related to image recognition.
Among the 136 cases that contained multiple-choice questions, the chatbot was tasked with fielding 429 multiple-choice questions; 448 images also were included in the analysis.
“The chatbot answered 299 multiple-choice questions correctly across all cases (70%). The chatbot’s performance was better on retina questions than neuro-ophthalmology questions (77% vs 58%; difference = 18%; 95% confidence interval [CI], 7.5%-29.4%; χ21 = 11.4; P < 0.001),” Dr. Mihalache and colleagues reported.
They also found that the chatbot did better answering nonimage–based questions compared with image-based questions (82% vs 65%; difference = 17%; 95% CI, 7.8%-25.1%; χ21 = 12.2; P < 0.001).
Finally, the chatbot showed an intermediate performance on questions based on the topics of ocular oncology (72% correct), pediatric ophthalmology (68% correct), uveitis (67% correct), and glaucoma (61% correct).
The authors concluded, “In this study, the recent version of the chatbot accurately responded to most multiple-choice questions pertaining to ophthalmic cases requiring multimodal input from OCTCases, albeit performing better on questions that did not rely on ophthalmic imaging interpretation. As multimodal large language models become increasingly widespread, it remains imperative to continuously stress their appropriate use in medicine and highlight concerns surrounding confidentiality and bioethics. Future studies should continue investigating the chatbot’s ability to interpret different ophthalmic imaging modalities to gauge whether it can eventually become as accurate as specific machine learning systems in ophthalmology. Future work should also evaluate the chatbot’s ability to interpret ophthalmic images that are not publicly accessible.”