LLMに基づく対話データに対する評価の自動化に関する検証

久保 祐喜; 山下 智也; 山田 真徳

doi:10.11517/pjsai.JSAI2024.0_4Xin253

Abstract

There are many methods for building dialogue systems, but research on evaluating dialogues remains challenging. Metrics like the quality of dialogue, which are difficult to quantify, are often evaluated by human judgement. Recently, methods using LLMs for evaluating dialogue data have been proposed. LLMs evaluate relatively similarly to human, but the evaluation is not similar sufficiently. The Elo rating system, which evaluates data by comparing two data, is assumed that it does not need to consider the difference of standards by evaluators. So, Elo rating system is expected to increase accuracy. In some cases, Elo rating system may not increase accuracy, like the distribution of evaluation values is biased. In this study, we examine whether the Elo rating system increases accuracy of evaluation in various distributions of evaluation values.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!