Article ID: EJ25-0201
GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation. Three language models—GPT-4o, GPT-4o-mini, and GPT-4o-RAG—were assessed using 599 publicly available multiple-choice questions from the 2022–2024 national examinations. For each model, we generated answers to each question five times and based our evaluation on these multiple outputs to assess response variability and robustness. A custom pipeline was implemented for GPT-4o-RAG to retrieve guideline-based documents for integration with GPT-generated responses. Accuracy rates, variance, and response consistency were evaluated. Term Frequency–Inverse Document Frequency analysis was conducted to compare word characteristics in correctly and incorrectly answered questions. All three models achieved accuracy rates >60%, the passing threshold. GPT-4o-RAG demonstrated the highest accuracy (83.5% ± 0.3%), followed by GPT-4o (82.1% ± 1.0%), and GPT-4o-mini (70.0% ± 1.4%). While the accuracy improvement of GPT-4o-RAG over GPT-4o was not statistically significant (p = 0.12), it exhibited significantly lower variance and higher response consistency (97.3% vs. 91.2–95.2%, p < 0.001). GPT-4o-RAG outperformed other models in applied and clinical nutrition categories but showed limited performance on numerical questions. Term Frequency–Inverse Document Frequency analysis suggested that incorrect answers were more frequently associated with numerical terms. GPT-4o-RAG improved response consistency and domain-specific performance, suggesting utility in clinical nutrition. However, limitations in numerical reasoning and individualized guidance warrant further development and validation.