Multimodal Voice Activity Projection for Turn-Taking and Effects on Speaker Adaptation

Kazuyo ONISHI; Hiroki TANAKA; Satoshi NAKAMURA

doi:10.1587/transinf.2024HCP0002

Abstract

The prediction of utterances in two-party conversations is a crucial technology for realizing natural turn-taking between humans and virtual agents. Recently, Voice Activity Projection (VAP) models, capable of a unified approach to various turn-taking events, have gained attention. This study investigates the incorporation of non-verbal features to enhance the performance of VAP models. Our results indicate that the integration of non-verbal features leads to significantly better performance in the VAP models, particularly in aspects of turn-shift prediction, overlap prediction, and backchannel prediction. Moreover, we explored the performance of VAP models using only single-speaker features, targeting their implementation in virtual agents. The findings demonstrate the feasibility of adequately predicting turn-taking from the user to the spoken dialogue system. The study also outlines the potential for further performance enhancement by integrating a variety of language and non-verbal features.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!