News|Articles|February 23, 2024

Diagnostic Accuracy Similar Between ChatGPT, Retina Specialists in Small Study

Glaucoma specialists were outperformed by a large language model chatbot when it came to diagnostic and treatment accuracy in glaucoma cases.

A large language model (LLM) chatbot was able to outperform glaucoma specialists and match retina specialists in terms of accuracy when presented with deidentified glaucoma and retina cases and questions, according to a study published in JAMA Ophthalmology. This finding indicates that it could be a diagnostic tool in the future.

LLMs chatbots—a form of artificial intelligence—have previously demonstrated their ability to perform well on Ophthalmic Knowledge Assess Program examinations, and research has begun to examine how they can be used in specific areas of ophthalmology. This study aimed to assess the broader capabilities of the chatbot by comparing its accuracy with that of ophthalmologists at the attending level. Glaucoma and retina specialists who were at the fellowship level were compared with the LLM in this study.

The cross-sectional study took place in a single center. All data for eyes were taken from the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. All specialists were practicing physicians in the same center. The researchers selected 10 glaucoma and retina questions each from the Commonly Asked Questions of the American Academy of Ophthalmology to test knowledge on clinical questions. To test case management knowledge, 10 of retina cases and 10 glaucoma cases were selected from patients in the department. All selections of questions and patients were random.

The GPT-4 chatbot, whose version was that from May 12, 2023, was used for the study. A 10-point Likert scale was used to measure the accuracy of all answers, with 1 and 2 representing poor or unacceptable inaccuracies and 9 and 10 representing very good accuracy without any inaccuracies. A 6-point scale was used to evaluate how medically complete the results were.

The specialists for retina and glaucoma answered the clinical questions and the case management questions, and their answers were compared with the answers generated by GPT-4 as the primary end point.

There were 1271 images for accuracy and 1267 images for completeness rated for this study. There were 12 specialists included, with 8 of them being glaucoma specialists and 4 being retina specialists; 3 ophthalmology trainees were also included. The mean (SD) amount of years that the participants practiced was 11.7 (13.5) years.

The LLM chatbot had a mean combined question-case accuracy rank of 506.2, whereas the glaucoma specialists had a mean rank of 403.4. The mean rank for completeness was similar within the 2 groups at 528.3 for the LLM chatbot and 398.7 for the specialists. The mean rank for combined accuracy was closer between the LLM chatbot and the retina specialists, at 235.3 and 216.1, respectively. The mean rank for completeness was comparable at 258.3 for the chatbot and 208.7 for the retina specialists.

“Both trainees and specialists rated the chatbot’s accuracy and completeness more favorably than those of their specialist counterparts,” the authors wrote, with specialists rating the chatbot significantly better than humans in terms of accuracy and completeness.

There were some limitations to this study. This study took place at a single center with only 1 group of attendings, which may not make it generalizable to other populations. There are also limitations to the decision-making of chatbots, especially with complex decisions, which should be considered.

Overall, this assessment found that the LLM chatbot displayed comparative accuracy in diagnosis to retina and glaucoma specialists when it came to both clinical questions and clinical cases, which indicates its potential use as a tool in diagnosis.

Reference

Huang AS, Hirabayashi K, Barna L, Parikh D, Pasquale LR. Assessment of a large language model’s response to questions and cases about glaucoma and retina management. JAMA Ophthalmol. Published online February 22, 2024. doi:10.1001/jamaophthalmol.2023.6917

Stay ahead of policy, cost, and value—subscribe to AJMC for expert insights at the intersection of clinical care and health economics.

Subscribe Now!

Latest CME

Online Article

Innovations in Medicine: 2024 Lineup of New and Approved Specialty Drugs

2.0 Credits / General Pharmacy

AJMC Supplement

Revolutionizing Acute Pain Relief: Emerging Nonopioid Therapies and the Essential Role of Managed Care

2.0 Credits / Pain Management

Diagnostic Accuracy Similar Between ChatGPT, Retina Specialists in Small Study

Newsletter

Related Content

Benefits, Tradeoffs of Medically Integrated Dispensing in Oncology Care: Katherine Tobon, PharmD, BCOP

Exploring Identity, Appearance Experiences Among Patients With Alopecia

5 Things to Know About the Oral GLP-1 Era

Even After 6 Years, Disease Progression Impacts Survival in Mantle Cell Lymphoma

Potential Predictor of Treatment Response in Chronic Spontaneous Urticaria Identified

Latest CME

Innovations in Medicine: 2024 Lineup of New and Approved Specialty Drugs

Revolutionizing Acute Pain Relief: Emerging Nonopioid Therapies and the Essential Role of Managed Care

Trending on AJMC

TrumpRx Launch Brings Savings—and Uncertainty

5 Things to Know About the Oral GLP-1 Era

Exploring Identity, Appearance Experiences Among Patients With Alopecia

Stroke Subtype Does Not Affect Efficacy of Asundexian for Second Stroke Prevention

Even After 6 Years, Disease Progression Impacts Survival in Mantle Cell Lymphoma