Endocrine Journal
Online ISSN : 1348-4540
Print ISSN : 0918-8959
ISSN-L : 0918-8959
Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions
Yu IshikawaAkitaka Higashi Nozomu AraiDaisuke OzoWataru HasegawaTetsuya ImamuraZenbei MatsumotoHidetaka NamboShigehiro Karashima
Author information
JOURNAL OPEN ACCESS Advance online publication
Supplementary material

Article ID: EJ25-0201

Details
Abstract

GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models—GPT-4o, GPT-4o-mini, and GPT-4o-RAG—were assessed using 599 publicly available multiple-choice questions from the 2022–2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency–Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2–95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency–Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.

Fullsize Image
Content from these authors
© The Japan Endocrine Society

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
Previous article Next article
feedback
Top