Host: The Japanese Society for Artificial Intelligence
Name : 34th Annual Conference, 2020
Number : 34
Location : Online
Date : June 09, 2020 - June 12, 2020
Humans can abstract rich representation from multi-modal information and use it in daily tasks. For instance, object concepts are represented by the combination of vision, sound, tactile, language, etc. During communication between humans, speakers express this information observed by their own sensory organs as linguistic information. At the same time, listeners infer the speakers’ sensation from linguistic information through their knowledge. Therefore, communication agents have to obtain the bidirectionally predictable knowledge from the multi-modal information. We propose a predictable bidirectional model between images and language based on BERT, which employs a hierarchical self-attention structure. The proposed cross-modal BERT was evaluated in a cross-modal prediction task and a multi-modal categorization task. Experimental results showed that the cross-modal BERT acquired rich multi-modal representation and performed cross-modal prediction in both directions. The proposed model also showed higher performance using multi-modal information rather than using a single modality in the category estimation task.