Abstract
This study assesses the effectiveness and variability of responses generated by various AI models in providing
guidance on insulin injection guidance. By examining the capacity of AI in facilitating diabetes self-management,
the research seeks to inform and advance the integration of AI technologies into healthcare practices, particularly
from a health policy perspective.
To evaluate the performance of AI systems in delivering insulin injection guidance, we compared four AI
models: the Diabetes Self-Management GPTs Support System (DSM-GPTs), a customized AI developed with
ChatGPT's GPTs, and three general-purpose AI models (GPT-4 Omni, Gemini 2.0 Flash, and Claude 3.7 Sonnet).
Standardized prompts tailored for both normal and older diabetes patients were employed to assess the models.
The outputs were analyzed using metrics such as word count, adherence to established injection protocols, and
scores generated by a customized Scoring-GPTs system, rated on a 100-point scale.
Eighty responses (10 per model per patient profile) were evaluated. All models achieved high median quality
scores (range 90–96/100). Claude 3.7 Sonnet obtained the highest mean score (95.7 ± 3.4), followed by GPT-4
Omni (94.1 ± 3.2), Gemini 2.0 Flash (92.4 ± 6.7) and DSM-GPTs (90.6 ± 3.4) . GPT-4 Omni exhibited the lowest
score variability, whereas Gemini 2.0 Flash showed the widest dispersion, reflecting less predictable performance.
Response length differed markedly across models: Gemini produced the longest explanations (median ≈ 700
words) and Claude the briefest (median ≈ 310 words), while DSM-GPTs and GPT-4 Omni provided intermediate-length, reader-friendly answers. Procedural analysis of 20 key injection checkpoints revealed that GPT-4 Omni
fully covered 66% of items, DSM-GPTs: 44%, Claude 3.7 Sonnet: 55%, and Gemini 2.0 Flash: 68%. GPT-4 Omni
and DSM-GPTs were particularly consistent in hygiene and safety steps, whereas Gemini 2.0 Flash omitted basic
preparation steps more frequently. DSM-GPTs uniquely incorporated geriatric-specific considerations (e.g.,
tremor, visual impairment) in 7/10 geriatric scenarios, exceeding the coverage of general-purpose models.
Large-language-model (LLM) systems can generate high-quality insulin-injection guidance, but substantial
differences exist in completeness, coherence and brevity. GPT-4 Omni balanced accuracy with concise delivery,
whereas DSM-GPTs provided the most tailored geriatric advice. These findings highlight the need for benchmark
frameworks and policy oversight to ensure safe, equitable deployment of AI-driven self-management tools in
older adults in the future.

Advancing Geriatric Diabetes Care: Performance Comparison of Artificial Intelligence (AI) Models and Health Policy Implications
Fullsize Image