Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
37th (2023)
Session ID : 2G5-OS-21e-03
Conference information

Generating Captions with Multi-level Multimodal Encoder on Image Captioning with Reading Comprehension Tasks
*Wei YANGArisa UEDAKomei SUGIURA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Visual scene understanding, such as image captioning, can be considered one of the essential topics in the artificial intelligence (AI) field. Image captioning with reading comprehension tasks as an extension of traditional image captioning is more challenging because the generated caption must be related to the text information in the image, and how to read and comprehend text in the context of an image needs to be studied. In this work, we propose multiple image-related attention blocks with multimodal Optical Character Recognition (OCR) information to model the relationship among the global image, multi-level recognized text, and the detected objects in the image. Our model is validated on the standard dataset TextCaps, and the results show that our model outperforms the baseline methods in terms of all evaluation matrices.

Content from these authors
© 2023 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top