Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
ITeM: Image-to-Text Matching for Multimodal Documents
Masayasu MuraokaNaoaki OkazakiRyosuke KohitaEtsuko Ishii
Author information
JOURNAL FREE ACCESS

2022 Volume 29 Issue 4 Pages 1198-1232

Details
Abstract

We propose a new task called image-to-text matching (ITeM) to facilitate multimodal document understanding. ITeM requires a system to learn a plausible assignment of images to texts in a multimodal document. To study this task, we systematically construct a dataset comprising 66,947 documents with 320,200 images from Wikipedia. We evaluate two existing state-of-the-art multimodal systems on our task to assess the validity and difficulty of our task. Experimental results show that the systems greatly outperform simple baselines while their performances are still far from that of humans. Further, the proposed task does not contribute significantly to the existing multimodal tasks; however, detailed analysis suggests that the task becomes more complex when more images are present in a document and that the proposed task can offer a new capability for image-to-text understanding not achievable through existing tasks, such as multiple image consideration or image abstraction.

Content from these authors
© 2022 The Association for Natural Language Processing
Previous article Next article
feedback
Top