Host: The Japanese Society for Artificial Intelligence
Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 37
Location : [in Japanese]
Date : June 06, 2023 - June 09, 2023
Visual scene understanding, such as image captioning, can be considered one of the essential topics in the artificial intelligence (AI) field. Image captioning with reading comprehension tasks as an extension of traditional image captioning is more challenging because the generated caption must be related to the text information in the image, and how to read and comprehend text in the context of an image needs to be studied. In this work, we propose multiple image-related attention blocks with multimodal Optical Character Recognition (OCR) information to model the relationship among the global image, multi-level recognized text, and the detected objects in the image. Our model is validated on the standard dataset TextCaps, and the results show that our model outperforms the baseline methods in terms of all evaluation matrices.