2024 Volume 39 Issue 3 Pages IDS6-C_1-8
A spoken dialogue system is required to continuously listen to a human user for smooth conversation. We propose a method that simultaneously performs response generation and response timing estimation. Our proposed method estimates response timing by adding pseudo-samples where response should be irrelevant, which allows using text-only conversation dataset without audio information. Furthermore, our proposed method can control substantialness of responses by user-specified parameter integrated with the Dynamic-Prompt-Tune method, which uses prompt token embedding dynamically generated from the parameter. Our automatic and manual evaluation showed that the proposed method can generate responses with more natural timing and more in line with the response substantialness parameter compared to the baseline model.