Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
34th (2020)
Session ID : 1Q3-GS-11-04
Conference information

Cross-modal BERT : Acquisition of Multimodal Representation and Cross-modal Prediction based on Self-Attention
*Yuta KYURAGIKazuki MIYAZAWATatsuya AOKITakato HORIITakayuki NAGAI
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Humans can abstract rich representation from multi-modal information and use it in daily tasks. For instance, object concepts are represented by the combination of vision, sound, tactile, language, etc. During communication between humans, speakers express this information observed by their own sensory organs as linguistic information. At the same time, listeners infer the speakers’ sensation from linguistic information through their knowledge. Therefore, communication agents have to obtain the bidirectionally predictable knowledge from the multi-modal information. We propose a predictable bidirectional model between images and language based on BERT, which employs a hierarchical self-attention structure. The proposed cross-modal BERT was evaluated in a cross-modal prediction task and a multi-modal categorization task. Experimental results showed that the cross-modal BERT acquired rich multi-modal representation and performed cross-modal prediction in both directions. The proposed model also showed higher performance using multi-modal information rather than using a single modality in the category estimation task.

Content from these authors
© 2020 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top