IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Human Communication VI
Multimodal Voice Activity Projection for Turn-Taking and Effects on Speaker Adaptation
Kazuyo ONISHIHiroki TANAKASatoshi NAKAMURA
Author information
JOURNAL FREE ACCESS

2025 Volume E108.D Issue 6 Pages 445-453

Details
Abstract

The prediction of utterances in two-party conversations is a crucial technology for realizing natural turn-taking between humans and virtual agents. Recently, Voice Activity Projection (VAP) models, capable of a unified approach to various turn-taking events, have gained attention. This study investigates the incorporation of non-verbal features to enhance the performance of VAP models. Our results indicate that the integration of non-verbal features leads to significantly better performance in the VAP models, particularly in aspects of turn-shift prediction, overlap prediction, and backchannel prediction. Moreover, we explored the performance of VAP models using only single-speaker features, targeting their implementation in virtual agents. The findings demonstrate the feasibility of adequately predicting turn-taking from the user to the spoken dialogue system. The study also outlines the potential for further performance enhancement by integrating a variety of language and non-verbal features.

Content from these authors
© 2025 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top