Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

Yuma Tsuta; Naoki Yoshinaga; Masashi Toyoda

doi:10.5715/jnlp.30.531

Abstract

Because open-domain dialogues allow diverse responses, common reference-based metrics for text generation, such as bleu, do not correlate well with human judgments unless we prepare an extensive reference set of high-quality responses for input utterances. In this study, we propose a fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, υbleu. Our method first collects diverse reference responses from massive dialogue data, annotates their quality judgments by using a neural network trained on automatically collected training data, and then computes weighted bleu using the automatically-retrieved and -rated reference responses. We also employ this method with an embedding-based metric, berts_core, instead of the word-overlap-based metric, bleu, to absorb surface variations of the reference responses. The experimental results on the meta-evaluation of our evaluation method for dialogue systems based on massive Twitter data confirmed that our method substantially improves correlations between bleu (or berts_core) and human judgments. We also confirmed that our method is effective when it is combined with a reference-free metric.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!