Host: The Japanese Society for Artificial Intelligence
Name : The 36th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 36
Location : [in Japanese]
Date : June 14, 2022 - June 17, 2022
Many attempts have been made to build multimodal dialog systems that can answer a question about given audio-visual information, and a representative task for such systems is Audio Visual Scene-Aware Dialog (AVSD). For understanding visual information, most conventional AVSD models adopt a Convolutional Neural Network (CNN)-based video feature extractor. Although CNN tends to extract spacially and temporally local information, global information is also crucial to boost video understanding, since AVSD requires long-term temporal visual dependency and whole visual information. In this study, we apply the video understanding module with space-time attention that can capture spatially and temporally global representations more efficiently than the CNN-based module. We observed that our AVSD model achieved high objective and subjective performance scores for answer generation.