会議発話間の関係性推定に向けた効率的な音声・動画情報の活用の検討

大杉 康仁; 小瀬木 悠佳; 立石 修平; 狩野 悌久; 中辻 真

doi:10.11517/pjsai.JSAI2023.0_4R2OS22a03

Abstract

Multimodal information such as audio and video can be effective to comprehend relationships between utterances in meetings. To incorporate long sequences of audio and video with short sequences of text, the appoach based on periodic averaging or samping of audio and video sequences has been proposed. This approach, however, tends to include less meaningful features of audio and video in window of sampling. We introduce a method that resamples audio and video embeddings based on attentions between embeddings and few latent features. Especailly, those fixed-length few latent features can capture information of varying-length audio and video sequences effectively. Experiments on the multimodal meeting corpus, AMI, showed that our multimodal method was comparable with text-only method in comprehension supportive relationships between utterances.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!