Text-to-Imageモデルの学習における最適キャプションの探索

中尾 純平; 磯沼 大; 森 純一郎; 坂田 一郎

doi:10.11517/pjsai.JSAI2023.0_1O5GS701

37th (2023)

Session ID : 1O5-GS-7-01

DOI https://doi.org/10.11517/pjsai.JSAI2023.0_1O5GS701

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 37

Location : [in Japanese]

Date : June 06, 2023 - June 09, 2023

Searching optimal caption in learning Text-to-Image model

*Jumpei NAKAO, Masaru ISONUMA, Junichiro MORI, Ichiro SAKATA

Author information

Keywords: Deep learning, Multimodal, Bilevel optimization

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

Text-to-Image models require datasets consisting of a huge number of image-caption pairs for training. Since the captions in such datasets are manually annotated, they are not necessarily optimal for training text-to-image models. In this study, we propose a learning framework that trains Text-to-Image models while optimizing the captions used for training. Specifically, we introduce a model that outputs pseudo captions from images and alternately update the parameters of the model and the Text-to-Image model through bilevel optimization. In the experiment, we evaluate the effectiveness of bilevel optimization for learning Text-to-Image models as a preliminary effort.

Corresponding author

Conference information

Register with J-STAGE for free!