Host: The Japanese Society for Artificial Intelligence
Name : 34th Annual Conference, 2020
Number : 34
Location : Online
Date : June 09, 2020 - June 12, 2020
In open-domain dialogues, the content and style of responses can vary. However, it is difficult to consider the diversity of responses when evaluating responses generated by dialogue systems, since basically only one response can be extracted as a reference response from real conversations. To address this problem, ΔBLEU uses reference responses that are extended with responses in massive dialogue data and are manually annotated with appropriateness as a response. Because the human annotation is costly, we cannot utilize ΔBLEU for a large-scale evaluation of open-domain dialogue systems that should be evaluated in various contexts. We propose a fully-automatic evaluation method ΔBLEU-auto that annotates the appropriateness of extended responses used in ΔBLEU by a classifier trained with automatically-collected training data. Experimental results confirmed that ΔBLEU-auto is comparable to ΔBLEU in terms of correlation with human judgement, and also improves the state-of-the-art evaluation method, RUBER, by integrating our ΔBLEU-auto into RUBER.