Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
There are many methods for building dialogue systems, but research on evaluating dialogues remains challenging. Metrics like the quality of dialogue, which are difficult to quantify, are often evaluated by human judgement. Recently, methods using LLMs for evaluating dialogue data have been proposed. LLMs evaluate relatively similarly to human, but the evaluation is not similar sufficiently. The Elo rating system, which evaluates data by comparing two data, is assumed that it does not need to consider the difference of standards by evaluators. So, Elo rating system is expected to increase accuracy. In some cases, Elo rating system may not increase accuracy, like the distribution of evaluation values is biased. In this study, we examine whether the Elo rating system increases accuracy of evaluation in various distributions of evaluation values.