複数粒度のマルチモダル情報を用いたテキスト付き画像の説明文生成

楊 巍; 植田 有咲; 杉浦 孔明

doi:10.11517/pjsai.JSAI2023.0_2G5OS21e03

Abstract

Visual scene understanding, such as image captioning, can be considered one of the essential topics in the artificial intelligence (AI) field. Image captioning with reading comprehension tasks as an extension of traditional image captioning is more challenging because the generated caption must be related to the text information in the image, and how to read and comprehend text in the context of an image needs to be studied. In this work, we propose multiple image-related attention blocks with multimodal Optical Character Recognition (OCR) information to model the relationship among the global image, multi-level recognized text, and the detected objects in the image. Our model is validated on the standard dataset TextCaps, and the results show that our model outperforms the baseline methods in terms of all evaluation matrices.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!