2025 Volume 70 Issue 3 Pages 90-97
This study develops a RAG-based chatbot augmented with our institution’s homepage as a knowledge source and conducts comparative experiments with different language models (LMs). We compared combinations of two embedding LMs (OpenAI text-embedding-3-large and Snowflake arctic-embed2) and five response LMs (GPT-4o mini and Gemma3 series 1B, 4B, 12B, and 27B). Each combination was evaluated using 30 questions in total : 10 questions for each of three categories : (i) open-ended question answering, (ii) true/false judgments on closed questions, and (iii) refusal to answer irrelevant questions. Performance improved as the parameter count of the response LMs increased, with Gemma3 27B achieving results nearly equivalent to GPT-4o mini. Models with 4B parameters or more appropriately refused to answer irrelevant questions, while models with 12B parameters or fewer failed to adequately handle categories (i) and (ii). Results suggest that response models at the Gemma3 27B scale or larger are necessary for practical RAG chatbot construction.