IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532

This article has now been updated. Please use the final version.

UTStyleCap4K: Generating Image Captions with Sentimental Styles
Chi ZHANGLi TAOToshihiko YAMASAKI
Author information
JOURNAL FREE ACCESS Advance online publication

Article ID: 2024EDP7036

Details
Abstract

Stylized image captioning is the task of generating image captions that have a description style, such as positive or negative sentiments. Recently, deep learning models have reached high performance in this task, but they still lack description accuracy and diversity, and they often suffer from the small size and the low descriptiveness of existing datasets. In this paper, we introduce a new dataset, UTStyleCap4K, which contains 4,644 images with three positive and three negative captions for every image (27,864 captions in total), collected by a crowdsourcing service. Experimental results show that our dataset is accurate in meaning and sentiments, diverse in the ways to describe the styles, and less similar to the base dataset, the MSCOCO dataset, than existing stylized image captioning datasets. We train multiple models on our dataset to set a baseline. We also propose a new Bidirectional Encoder Representations from Transformers (BERT) based model, StyleCapBERT, that controls the length and style of the generated captions at the same time, by introducing length and style information into the embeddings of caption words. Experimental results show that our model is capable of generating captions of three sentimental styles, positive, factual, and negative, at the same time, and achieving the best performance on our dataset.

Content from these authors
© 2024 The Institute of Electronics, Information and Communication Engineers
feedback
Top