Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
36th (2022)
Session ID : 1H4-OS-17a-03
Conference information

Multimodal Identification of Cartoons with Vision Transformer and BERT
*Naoto AOKINaoki MORIMakoto OKADA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Against the background of the development of deep learning, research on the understanding and generation of creative works by computers has been actively researched. However, understanding and generating creations are intellectual tasks, and understanding them by computers is a difficult task. In this study, we focus on comics among creative works. Comics are typical multimodal creations, and have recently attracted attention as multimodal data. Since comics are composed of pictures and letters, comic engineering has aspects of image processing and natural language processing. In this field, there are many researches using image processing and natural language models, but there are few researches using image and natural language in a multimodal way. In this study, we solve the problem of work identification using distributed representations of both images and text. We used Manga109 for the cartoon dataset, Vision Transformer (ViT) for the distributed representation of images, and BERT (Bidirectional encoder representations from transformers) for the distributed representation of natural language.

Content from these authors
© 2022 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top