Article
Author(s):
When presented with a set of multiple choice questions used to help trainees practice for the ophthalmology board certification exams, the artificial intelligence (AI) chatbot ChatGPT was able to answer only half correctly.
Preparing for ophthalmic board certification through the Ophthalmic Knowledge Assessment Program (OKAP) and Written Qualitfying Exam (WQE) examinations will not be made easier with artificial intelligence (AI), according to a new study from JAMA Ophthalmology. Researchers found that ChatGPT could answer approximately half of the presented multiple-choice questions correctly when prompted.
ChatGPT is an AI chatbot developed by OpenAI that can interact with users conversationally and act as an educational tool when used appropriately. Using ChatGPT responsibly in medical education and clinical practice is vital in the future, the authors of the current study noted.
Although a past study found that ChatGPT has knowledge equivalent to that of a third-year medical student when answering questions related to the United States Medical Licensing Examination, the performance of ChatGPT in other disciplines is unclear. The current study aimed to assess the knowledge of ChatGPT against practice questions used for board certification examinations for ophthalmology.
All questions were collected from the free trial of OphthoQuestions, which provides practice questions for the OKAP and WQE tests. Questions that required input of images and videos were excluded, whereas text-based questions were left in.
The researchers’ primary outcome was the performance of ChatGPT in answering the questions; secondary outcomes included whether ChatGPT provided explanations, the mean length of questions and responses, performance in answering questions without multiple-choice options, and changes in performance.
All conversations in ChatGPT were cleared before asking each question to avoid responses being influenced by past conversations. A new account was also used to avoid any past history influencing the answers. The primary analysis used the January 9 version of ChatGPT, whereas the secondary analysis used the February 13 version. All answers were manually reviewed by the authors.
ChatGPT answered questions from January 9 to 16, 2023, in the primary analysis and on February 17, 2023, in the secondary analysis. There were 125 text-based questions asked and analyzed by ChatGPT of the 166 available. All included questions were high yield for board certification examinations per OphthoQuestions.
ChatGPT had high demand when responding to 44 questions (35%) and its mean (SD) response time was 17.8 (14.4) seconds. ChatGPT was able to answer 58 of 125 questions (46.4%) correctly in January 2023. General medicine questions had the best results, with 11 of 14 (79%) questions correctly answered. Retina and vitreous questions had the worst results, with ChatGPT incorrectly answering all of them.
Additional insight or explanations were provided for 79 of 125 questions (63%); the proportion of questions given explanations or insights was similar between the questions answered incorrectly and correctly (difference, 5.8%; 95% CI, –11.0% to 22.0%). Length of questions was similar between questions that were answered correctly and incorrectly (difference, 21.4 characters; SE, 36.8; 95% CI, –51.5 to 94.3) and length of answers was also similar regardless of accuracy (difference, –80.0 characters; SE, 65.4; 95% CI, –209.5 to 49.5).
The most common answer provided by ophthalmology trainees on OphthoQuestions was selected 44% of the time in the multiple-choice responses; 11% of the time ChatGPT selected the least popular answer, the second least popular answer was selected 18% of the time, and the second most popular was selected 22% of the time.
ChatGPT improved in the February 2023 analysis, with questions answered correctly in 73 of the 125 questions (58%). Stand-alone questions without multiple-choice options performed similarly to multiple-choice questions, with 42 of 78 (54%) stand-alone questions answered correctly (difference, 4.6%; 95% CI, –9.2% to 18.3%).
Internet speed, online traffic, and delays in response time could have biased certain parameters, the authors wrote in diacussing the study’s limitations. Different results could be yielded in a separate study due to ChatGPT providing unique answers. Questions that were not text based were excluded from the study. Questions may have been answered more broadly when not multiple choice, which could have led to an incorrect response.
The researchers concluded that ChatGPT was not able to “answer sufficient multiple-choice questions correctly for it to provide substantial assistance in preparing for board certification at this time.” However, they acknowledged that future studies should evaluate the progression of AI chatbots’ performance.
Reference
Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. Published online April 27, 2023. doi:10.1001/jamaophthalmol.2023.1144