大規模マルチリファレンスに基づく雑談対話システムの自動評価に向けた実験的検討

杉山 弘晃; 目黒 豊美; 東中 竜一郎

doi:10.11517/jsaislud.71.0_01

Abstract

The evaluation of conversational systems that chat with people remains an open-problem. Some studies have evaluated them by hand with ordinal scales like the Likert scale. One limitation with this approach is that we cannot use the previously evaluated values since the ordinal scales are not consistent across all of the evaluations. This makes it difficult to compare proposed and previous systems since we have to implement the previous systems and simultaneously evaluate them. We propose an automatic evaluation method for conversational systems that evaluates the sentences generated by systems on the basis of the similarities that are calculated with many reference sentences and their annotated evaluation values. Our proposed method's correlation coefficient with humans reached 0.514, and that of the human annotators was 0.783. Although there remains a gap between the estimated and the human-annotated values, the proposed method outperforms a baseline method that uses the BLEU scores as the evaluation values. We also show that we can gain a correlation coefficient of 0.499 with evaluating just 7% of all the data.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!