News Image Caption Generation

Zhishen YANG; Naoaki OKAZAKI

doi:10.11517/pjsai.JSAI2020.0_2D1GS905

Abstract

Vision and language as a vibrant multimodal machine learning research field aim to create models that serve comprehension of information across vision and language modalities. In this work, we utilized the multimodal Transformer model with joint text-vision representation to approach one of the vision and language tasks: news image caption generation. The multimodal Transformer model leverages context from the article with consideration of the scene in the associated image to generate caption. The experimental result demonstrated the multimodal Transformer significantly improved the quality of generated news image caption.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!