Space-Time Attentionを用いた動画理解機構に基づくEnd-to-Endマルチモーダル対話応答生成

山﨑 善啓; 折橋 翔太; 増村 亮; 内田 美尋; 高島 瑛彦

doi:10.11517/pjsai.JSAI2022.0_1O5GS705

36th (2022)

Session ID : 1O5-GS-7-05

DOI https://doi.org/10.11517/pjsai.JSAI2022.0_1O5GS705

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 36th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 36

Location : [in Japanese]

Date : June 14, 2022 - June 17, 2022

Multimodal conversational response generation based on video understanding module with space-time attention

*Yoshihiro YAMAZAKI, Shota ORIHASHI, Ryo MASUMURA, Mihiro UCHIDA, Akihiko TAKASHIMA

Author information

Keywords: Multimodal dialog, Video understanding, End-to-End response generation

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

Many attempts have been made to build multimodal dialog systems that can answer a question about given audio-visual information, and a representative task for such systems is Audio Visual Scene-Aware Dialog (AVSD). For understanding visual information, most conventional AVSD models adopt a Convolutional Neural Network (CNN)-based video feature extractor. Although CNN tends to extract spacially and temporally local information, global information is also crucial to boost video understanding, since AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the video understanding module with space-time attention that can capture spatially and temporally global representations more efficiently than the CNN-based module. We observed that our AVSD model achieved high objective and subjective performance scores for answer generation.

Corresponding author

Conference information

Register with J-STAGE for free!