Stylized image captioning is the task of generating image captions that have a description style, such as positive or negative sentiments. Recently, deep learning models have reached high performance in this task, but they still lack description accuracy and diversity, and they often suffer from the small size and the low descriptiveness of existing datasets. In this paper, we introduce a new dataset, UTStyleCap4K, which contains 4,644 images with three positive and three negative captions for every image (27,864 captions in total), collected by a crowdsourcing service. Experimental results show that our dataset is accurate in meaning and sentiments, diverse in the ways to describe the styles, and less similar to the base dataset, the MSCOCO dataset, than existing stylized image captioning datasets. We train multiple models on our dataset to set a baseline. We also propose a new Bidirectional Encoder Representations from Transformers (BERT) based model, StyleCapBERT, that controls the length and style of the generated captions at the same time, by introducing length and style information into the embeddings of caption words. Experimental results show that our model is capable of generating captions of three sentimental styles, positive, factual, and negative, at the same time, and achieving the best performance on our dataset.
抄録全体を表示