Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
Evaluating Dialogue Response Generation Systems via Response Selection with Well-chosen False Candidates
Shiki SatoReina AkamaHiroki OuchiJun SuzukiKentaro Inui
Author information
JOURNAL FREE ACCESS

2022 Volume 29 Issue 1 Pages 53-83

Details
Abstract

Developing an automatic evaluation framework for open-domain dialogue response generation systems that can validate the effects of daily system improvements at a low cost is necessary. However, existing metrics commonly used for automatic response generation evaluation, such as bilingual evaluation understudy (BLEU), correlate poorly with human evaluation. This poor correlation arises from the nature of dialogue, i.e., several acceptable responses to an input context. To address this issue, we focus on evaluating response generation systems via response selection. In this task, for a given context, systems select an appropriate response from a set of response candidates. Because the systems can only select specific candidates, evaluation via response selection can mitigate the effect of the above-mentioned nature of dialogue. Generally, false response candidates are randomly sampled from other unrelated dialogues, resulting in two issues: (a) unrelated false candidates and (b) acceptable utterances marked as false. General response selection test sets are unreliable owing to these issues. Thus, this paper proposes a method for constructing response selection test sets with well-chosen false candidates. Experiments demonstrate that evaluating systems via response selection with well-chosen false candidates correlates more strongly with human evaluation compared with commonly used automatic evaluation metrics such as BLEU.

Content from these authors
© 2022 The Association for Natural Language Processing
Previous article Next article
feedback
Top