マルチモーダル深層学習を用いた画像とテキストの意味理解に基づく整合性判定

鈴木 莉子; 小西 幹人; 池田 順哉; 林 大地; 深井 颯; 菅原 優; 町井 湧介; 山浦 佑介

doi:10.11517/pjsai.JSAI2020.0_3Q5GS901

34th (2020)

Session ID : 3Q5-GS-9-01

DOI https://doi.org/10.11517/pjsai.JSAI2020.0_3Q5GS901

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : 34th Annual Conference, 2020

Number : 34

Location : Online

Date : June 09, 2020 - June 12, 2020

Semantic Consistency Assessment of Visual and Text Content using Multimodal Deep Neural Networks

Riko SUZUKI, *Mikito KONISHI, Junya IKEDA, Daichi HAYASHI, So FUKAI, Yu SUGAWARA, Yusuke MACHII, Yusuke YAMAURA

Author information

Keywords: Multimodal, Deep Learning, Natural Language Processing, Image Recognition, Cross Attention

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

Semantic consistency assessment of an image and text inside a document is important task because readers refer the image to deepen understanding of text content. In this study, we develop a multimodal deep neural networks for the semantic consistency assessment of the image and the text. We propose a novel approach combines binary classification and angular margin loss to acquire discriminative features. We also clarify contradictions between the image and the text by visualizing cross-attention among objects inside the image and words in text. To show the effectiveness of our approach, we evaluate the accuracy of several models using flickr30k dataset which contains images and their captions. The results show that our proposed model outperforms the existing joint embedding model with 0.9 improvements in F-measure.

Corresponding author

Conference information

Register with J-STAGE for free!