Vision Transformer と BERT を用いた漫画のマルチモーダル識別

青木 尚人; 森 直樹; 岡田 真

doi:10.11517/pjsai.JSAI2022.0_1H4OS17a03

Abstract

Against the background of the development of deep learning, research on the understanding and generation of creative works by computers has been actively researched. However, understanding and generating creations are intellectual tasks, and understanding them by computers is a difficult task. In this study, we focus on comics among creative works. Comics are typical multimodal creations, and have recently attracted attention as multimodal data. Since comics are composed of pictures and letters, comic engineering has aspects of image processing and natural language processing. In this field, there are many researches using image processing and natural language models, but there are few researches using image and natural language in a multimodal way. In this study, we solve the problem of work identification using distributed representations of both images and text. We used Manga109 for the cartoon dataset, Vision Transformer (ViT) for the distributed representation of images, and BERT (Bidirectional encoder representations from transformers) for the distributed representation of natural language.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!