マルチモーダル文書におけるテキストと画像の整合タスクの提案

村岡 雅康; 岡崎 直観; 小比田 涼介; 石井 悦子

doi:10.5715/jnlp.29.1198

Abstract

We propose a new task called image-to-text matching (ITeM) to facilitate multimodal document understanding. ITeM requires a system to learn a plausible assignment of images to texts in a multimodal document. To study this task, we systematically construct a dataset comprising 66,947 documents with 320,200 images from Wikipedia. We evaluate two existing state-of-the-art multimodal systems on our task to assess the validity and difficulty of our task. Experimental results show that the systems greatly outperform simple baselines while their performances are still far from that of humans. Further, the proposed task does not contribute significantly to the existing multimodal tasks; however, detailed analysis suggests that the task becomes more complex when more images are present in a document and that the proposed task can offer a new capability for image-to-text understanding not achievable through existing tasks, such as multiple image consideration or image abstraction.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!