2022 Volume 29 Issue 1 Pages 53-83
Developing an automatic evaluation framework for open-domain dialogue response generation systems that can validate the effects of daily system improvements at a low cost is necessary. However, existing metrics commonly used for automatic response generation evaluation, such as bilingual evaluation understudy (BLEU), correlate poorly with human evaluation. This poor correlation arises from the nature of dialogue, i.e., several acceptable responses to an input context. To address this issue, we focus on evaluating response generation systems via response selection. In this task, for a given context, systems select an appropriate response from a set of response candidates. Because the systems can only select specific candidates, evaluation via response selection can mitigate the effect of the above-mentioned nature of dialogue. Generally, false response candidates are randomly sampled from other unrelated dialogues, resulting in two issues: (a) unrelated false candidates and (b) acceptable utterances marked as false. General response selection test sets are unreliable owing to these issues. Thus, this paper proposes a method for constructing response selection test sets with well-chosen false candidates. Experiments demonstrate that evaluating systems via response selection with well-chosen false candidates correlates more strongly with human evaluation compared with commonly used automatic evaluation metrics such as BLEU.