Advancing Geriatric Diabetes Care: Performance Comparison of Artificial Intelligence (AI) Models and Health Policy Implications

Ken Tanaka; Tatsuya Motoki; Hiroki Okazaki; Yosuke Matsui; Hirotaka Nakashima

doi:10.6011/apj.2025.02

Abstract

   This study assesses the effectiveness and variability of responses generated by various AI models in providing guidance on insulin injection guidance. By examining the capacity of AI in facilitating diabetes self-management, the research seeks to inform and advance the integration of AI technologies into healthcare practices, particularly from a health policy perspective.
   To evaluate the performance of AI systems in delivering insulin injection guidance, we compared four AI models: the Diabetes Self-Management GPTs Support System (DSM-GPTs), a customized AI developed with ChatGPT's GPTs, and three general-purpose AI models (GPT-4 Omni, Gemini 2.0 Flash, and Claude 3.7 Sonnet). Standardized prompts tailored for both normal and older diabetes patients were employed to assess the models. The outputs were analyzed using metrics such as word count, adherence to established injection protocols, and scores generated by a customized Scoring-GPTs system, rated on a 100-point scale.
   Eighty responses (10 per model per patient profile) were evaluated. All models achieved high median quality scores (range 90–96/100). Claude 3.7 Sonnet obtained the highest mean score (95.7 ± 3.4), followed by GPT-4 Omni (94.1 ± 3.2), Gemini 2.0 Flash (92.4 ± 6.7) and DSM-GPTs (90.6 ± 3.4) . GPT-4 Omni exhibited the lowest score variability, whereas Gemini 2.0 Flash showed the widest dispersion, reflecting less predictable performance. Response length differed markedly across models: Gemini produced the longest explanations (median ≈ 700 words) and Claude the briefest (median ≈ 310 words), while DSM-GPTs and GPT-4 Omni provided intermediate-length, reader-friendly answers. Procedural analysis of 20 key injection checkpoints revealed that GPT-4 Omni fully covered 66% of items, DSM-GPTs: 44%, Claude 3.7 Sonnet: 55%, and Gemini 2.0 Flash: 68%. GPT-4 Omni and DSM-GPTs were particularly consistent in hygiene and safety steps, whereas Gemini 2.0 Flash omitted basic preparation steps more frequently. DSM-GPTs uniquely incorporated geriatric-specific considerations (e.g., tremor, visual impairment) in 7/10 geriatric scenarios, exceeding the coverage of general-purpose models.
   Large-language-model (LLM) systems can generate high-quality insulin-injection guidance, but substantial differences exist in completeness, coherence and brevity. GPT-4 Omni balanced accuracy with concise delivery, whereas DSM-GPTs provided the most tailored geriatric advice. These findings highlight the need for benchmark frameworks and policy oversight to ensure safe, equitable deployment of AI-driven self-management tools in older adults in the future.

Advancing Geriatric Diabetes Care: Performance Comparison of Artificial Intelligence (AI) Models and Health Policy Implications Fullsize Image

Content from these authors

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/

Favorites & Alerts

Corresponding author

Funder information

1.Fund name: the Quad Fellowship

Register with J-STAGE for free!