Cross-modal BERT : Self-Attentionによるマルチモーダル情報表現の獲得と相互予測

久良木 優太; 宮澤 和貴; 青木 達哉; 堀井 隆斗; 長井 隆行

doi:10.11517/pjsai.JSAI2020.0_1Q3GS1104

Abstract

Humans can abstract rich representation from multi-modal information and use it in daily tasks. For instance, object concepts are represented by the combination of vision, sound, tactile, language, etc. During communication between humans, speakers express this information observed by their own sensory organs as linguistic information. At the same time, listeners infer the speakers’ sensation from linguistic information through their knowledge. Therefore, communication agents have to obtain the bidirectionally predictable knowledge from the multi-modal information. We propose a predictable bidirectional model between images and language based on BERT, which employs a hierarchical self-attention structure. The proposed cross-modal BERT was evaluated in a cross-modal prediction task and a multi-modal categorization task. Experimental results showed that the cross-modal BERT acquired rich multi-modal representation and performed cross-modal prediction in both directions. The proposed model also showed higher performance using multi-modal information rather than using a single modality in the category estimation task.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!