In recent years, artificial intelligence (AI) technologies, such as image recognition and natural language processing, have been increasingly used in the medical field, and its range of application is expanding with technological innovation. However, some limitations to its capabilities in some fields have been acknowledged. Therefore, our understanding of its use and associated risks is necessary for safe and effective medical applications. ChatGPT is one of the natural language processing technologies that were released in 2022. AI, which uses data from the Internet and other sentences, generates responses to input instructions (prompts) and can further increase its accuracy depending on the instruction method. Few studies have reported on the effectiveness of Japanese, a non-English language, in the medical field, as it has been reported to attain the passing scores of bar examinations and medical licensing examinations in the United States. In this study, we evaluated the performance of ChatGPT in the 2022 Otolaryngology Specialist Examination and discussed its effectiveness in the field of Japanese language otolaryngology and the challenges of using AI.
In this study, 48 questions from the 2022 Otolaryngology Specialist Examination, including multiple choice questions but excluding those with diagrams, were used in this evaluation. In inputting the question statements as prompts, we used four different methods: original questions, Japanese prompts, English translations of the Japanese prompts, and English prompts. Since two different versions of ChatGPT-GPT-3.5 and GPT-4-are currently available, eight different methods were validated at five different times. The accuracy of their responses were compared to the correct answers and evaluated.
For ChatGPT-3.5 and GPT-4, the average accuracies were 31.67% and 45.42%, when the question statements were used as prompts. The accuracies when Japanese prompts were used were 35.00% and 43.75% for ChatGPT-3.5 and GPT-4, respectively. For the translated English instructions, the accuracies were 39.58% and 52.08% for ChatGPT-3.5 and GPT-4, respectively. For the direct English prompts, the accuracies were 50.42% and 65.00% for ChatGPT-3.5 and GPT-4, respectively. The percentage of correct answers improved with the GPT version and English translations. A similar trend was observed, with respect to the types of questions answered correctly by GPT-3.5 and GPT-4. With respect to questions with low percentages of correct answers, we observed similarities. The questions with the highest percentage of incorrect answers tended to be related to otology, vertigo equilibrium, voice, and institutions.
ChatGPT had a maximum correct response rate of 65% in the Otolaryngology Specialist Examination. In the future, higher correct response rates may be achieved by improving the accuracy of ChatGPT and developing new prompts. From our study findings, a non-English language, Japanese, can achieve a certain level of accuracy in the field of otolaryngology, and this finding is necessary in improving our understanding of the usefulness and challenges of AI in clinical otolaryngology. ChatGPT did not always provide correct answers. Additionally, it had correct response rates that varied with the types of prompt. Therefore, at this time, the final judgement must be made by a person on the answers provided by ChatGPT. In the field of otolaryngology, physicians would need to be tactful in the use of artificial intelligence in clinical management practices, such as low-risk medical treatments.
View full abstract