IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Special Section on Knowledge-Based Software Engineering
Contextualized Language Generation on Visual-to-Language Storytelling
Rizal Setya PERDANAYoshiteru ISHIDA
Author information

2022 Volume E105.D Issue 5 Pages 873-886


This study presents a formulation for generating context-aware natural language by machine from visual representation. Given an image sequence input, the visual storytelling task (VST) aims to generate a coherent, object-focused, and contextualized sentence story. Previous works in this domain faced a problem in modeling an architecture that works in temporal multi-modal data, which led to a low-quality output, such as low lexical diversity, monotonous sentences, and inaccurate context. This study introduces a further improvement, that is, an end-to-end architecture, called cross-modal contextualize attention, optimized to extract visual-temporal features and generate a plausible story. Visual object and non-visual concept features are encoded from the convolutional feature map, and object detection features are joined with language features. Three scenarios are defined in decoding language generation by incorporating weights from a pre-trained language generation model. Extensive experiments are conducted to confirm that the proposed model outperforms other models in terms of automatic metrics and manual human evaluation.

Related papers from these authors
© 2022 The Institute of Electronics, Information and Communication Engineers
Previous article Next article