Can Large Language Models Help Healthcare?

Yoshihiro Miyamoto

doi:10.5551/jat.ED273

See article vol. 32: 567-579

Large-language models (LLMs) are used in medicine. LLMs have ‘natural language capabilities’ and ‘generative AI’ capabilities. LLMs function as ‘probabilistic models’ that select the next appropriate word or phrase in response to user input, based on the context and content. This mechanism generates the most probable (i.e., contextually appropriate) response without any specific long-term goals or objectives. LLMs then learn large amounts of textual data, not to say that they are understanding, but that they acquire the ability to generate responses based on the training data. LLMs have shown the potential to automatically summarize information in electronic medical records, thereby reducing the burden on medical practices and allowing doctors to concentrate more on patient care¹⁾.

ChatGPT is an LLMs. This study evaluated the accuracy and reproducibility of the ChatGPT-3.5 (OpenAI) responses to clinical questions (CQs) regarding the Japanese Atherosclerosis Society’s Guidelines for the Prevention of Atherosclerotic Cardiovascular Disease, 2022 Edition (JAS Guidelines 2022). The accuracy of background questions (BQs) and foreground questions (FQs) in clinical practice decision-making was independently assessed by three researchers using a six-point Likert scale. The results showed that the ChatGPT responses were highly accurate and reproducible, particularly for the FQs, and the FQs are speculated to be more precise than the BQs because they are based on evidence from randomized controlled trials^{2, 3)}. Kusunose et al. also evaluated the accuracy of ChatGPT responses to the CQs in the Japanese Society of Hypertension (JSH) 2019 guidelines and reported⁴⁾.

While these studies suggest that LLMs, such as ChatGPT, can effectively assist healthcare professionals in interpreting guidelines, some limitations must also be considered. First, the quality and range of the training data: Although ChatGPT is trained using a large amount of textual data, the source and range of this data are not open and may not be specific to the medical field. As such, it may require more specialized medical information or may not be based on updated guidelines⁵⁾. ChatGPT-3.5 training data are only available until September 2021 and therefore do not include information beyond that date. In addition, the ChatGPT does not have a real-time search function and relies on information from when the model is trained. This stops the immediate reflection on the latest research findings and guidelines. New treatments and procedures are frequently updated in the medical field; therefore, there is a risk of providing answers based on outdated information. Zakka et al. used external resources (search engines and medical databases) for questions about medical guidelines and treatments to augment the information in their language model (Almanac). They reported that Almanac showed a higher accuracy, comprehensiveness, and user favorability than other LLMs (ChatGPT-4, Bing, Bard), particularly in countervailing safety⁶⁾. Another report found that fine-tuned LLMs based on prescribing data annotated by 1,000 experts significantly reduced the incidence of near-miss (errors detected before reaching the patient) events compared to traditional LLMs⁷⁾. Training specialist medical information in this manner may improve accuracy. Second, there is the issue of ‘hallucination’ or the generation of false information: ChatGPT can sometimes generate misinformation that appears to be factual (so-called hallucinations). This is more likely to occur when there is no reliable answer to a particular question or when the training data do not contain that information. ChatGPT is not specific to a particular medical field, because it generates answers from a general knowledge base. Consequently, it may be necessary to provide more information for questions that require specific expertise⁸⁾. Thirdly, the use of AI in healthcare is accompanied by ethical and legal issues related to the protection of patient privacy and responsibility. Privacy protection and data management must be carefully addressed. Considering the above, it is essential to be aware of the limitations of using LLMs, such as ChatGPT, to assist healthcare professionals in interpreting and using guidelines with caution.

When doctors and ChatGPT responses to health questions were anonymized, randomly ordered, and rated by a team of healthcare professionals, the ChatGPT responses were rated higher than doctors’ responses for information quality and empathy⁹⁾. However, medical advice shown to be AI-generated was reported to have lower public acceptance and trust and less empathy than advice from a doctor¹⁰⁾. Measures are also needed to explain how AI can assist doctors in providing personalized care to patients and reduce bias. Therefore, we must understand the properties of LLMs and find ways to utilize them rationally (Fig.1). We must always consider whether a balance between the cost of building and maintaining valuable LLMs and their benefits can be achieved.

Fig.1.

AI-Driven Feedback Loop for Medical Diagnosis, Treatment Recommendations, and Patient Support

Conflict of Interest

None.

References

Corresponding author

Register with J-STAGE for free!