Multimodal Voice Activity Projection for Turn-Taking and Effects on Speaker Adaptation

Kazuyo ONISHI; Hiroki TANAKA; Satoshi NAKAMURA

doi:10.1587/transinf.2024HCP0002

抄録

The prediction of utterances in two-party conversations is a crucial technology for realizing natural turn-taking between humans and virtual agents. Recently, Voice Activity Projection (VAP) models, capable of a unified approach to various turn-taking events, have gained attention. This study investigates the incorporation of non-verbal features to enhance the performance of VAP models. Our results indicate that the integration of non-verbal features leads to significantly better performance in the VAP models, particularly in aspects of turn-shift prediction, overlap prediction, and backchannel prediction. Moreover, we explored the performance of VAP models using only single-speaker features, targeting their implementation in virtual agents. The findings demonstrate the feasibility of adequately predicting turn-taking from the user to the spoken dialogue system. The study also outlines the potential for further performance enhancement by integrating a variety of language and non-verbal features.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PPV is available from https://globals.ieice.org/en_transactions/information

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）