IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Human Communication VI
Multimodal Voice Activity Projection for Turn-Taking and Effects on Speaker Adaptation
Kazuyo ONISHIHiroki TANAKASatoshi NAKAMURA
著者情報
ジャーナル フリー

2025 年 E108.D 巻 6 号 p. 445-453

詳細
抄録

The prediction of utterances in two-party conversations is a crucial technology for realizing natural turn-taking between humans and virtual agents. Recently, Voice Activity Projection (VAP) models, capable of a unified approach to various turn-taking events, have gained attention. This study investigates the incorporation of non-verbal features to enhance the performance of VAP models. Our results indicate that the integration of non-verbal features leads to significantly better performance in the VAP models, particularly in aspects of turn-shift prediction, overlap prediction, and backchannel prediction. Moreover, we explored the performance of VAP models using only single-speaker features, targeting their implementation in virtual agents. The findings demonstrate the feasibility of adequately predicting turn-taking from the user to the spoken dialogue system. The study also outlines the potential for further performance enhancement by integrating a variety of language and non-verbal features.

著者関連情報
© 2025 The Institute of Electronics, Information and Communication Engineers
前の記事 次の記事
feedback
Top