Journal of Information Processing
Online ISSN : 1882-6652
ISSN-L : 1882-6652
 
Active Utterance Collection for Efficient NLU Model Training in Dialog Systems
Rui YangKei Wakabayashi
Author information
JOURNAL FREE ACCESS

2025 Volume 33 Pages 880-889

Details
Abstract

The development of natural language understanding (NLU) models for dialogue systems necessitates the collection of a large volume of user utterances as training data, which requires significant human effort. To improve the efficiency of data collection, we develop a novel active utterance collection framework that leverages dialog scenes, which are the states of the dialog manager in the system, to actively control the data collection process. The key idea of the proposed method is to identify dialog scenes where the current NLU model performs worse and collect more data instances in those scenes to efficiently improve the model's performance. To estimate the performance of the NLU model on each dialog scene, we propose two strategies to generate validation data, including a method that uses large language models (LLMs). Empirical evaluations on the Schema-Guided Dialog dataset indicate that the proposed method can improve the efficiency of data collection in scenarios where a substantial labeled validation dataset is available. However, its efficacy diminishes in settings with practical constraints that limit the availability of validation data. These findings underscore the potential of the proposed approach, which opens new avenues for future research in practical methods for enhancing the efficiency of data collection in dialog systems development.

Content from these authors
© 2025 by the Information Processing Society of Japan
Previous article Next article
feedback
Top