Host: The Japanese Society for Artificial Intelligence
Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 37
Location : [in Japanese]
Date : June 06, 2023 - June 09, 2023
Multimodal information such as audio and video can be effective to comprehend relationships between utterances in meetings. To incorporate long sequences of audio and video with short sequences of text, the appoach based on periodic averaging or samping of audio and video sequences has been proposed. This approach, however, tends to include less meaningful features of audio and video in window of sampling. We introduce a method that resamples audio and video embeddings based on attentions between embeddings and few latent features. Especailly, those fixed-length few latent features can capture information of varying-length audio and video sequences effectively. Experiments on the multimodal meeting corpus, AMI, showed that our multimodal method was comparable with text-only method in comprehension supportive relationships between utterances.