2024 Volume 91 Issue 2 Pages 155-161
Background: Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear. Methods: To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses. Results: The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions. Conclusion: An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.