Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
34th (2020)
Session ID : 2D1-GS-9-05
Conference information

News Image Caption Generation
*Zhishen YANGNaoaki OKAZAKI
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Vision and language as a vibrant multimodal machine learning research field aim to create models that serve comprehension of information across vision and language modalities. In this work, we utilized the multimodal Transformer model with joint text-vision representation to approach one of the vision and language tasks: news image caption generation. The multimodal Transformer model leverages context from the article with consideration of the scene in the associated image to generate caption. The experimental result demonstrated the multimodal Transformer significantly improved the quality of generated news image caption.

Content from these authors
© 2020 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top