News Image Caption Generation

Zhishen YANG; Naoaki OKAZAKI

doi:10.11517/pjsai.JSAI2020.0_2D1GS905

34th (2020)

セッションID: 2D1-GS-9-05

DOI https://doi.org/10.11517/pjsai.JSAI2020.0_2D1GS905

会議情報

主催: The Japanese Society for Artificial Intelligence

会議名: 第34回全国大会(2020)

回次: 34

開催地: Online

開催日: 2020/06/09 - 2020/06/12

News Image Caption Generation

*Zhishen YANG, Naoaki OKAZAKI

著者情報

キーワード: vision and language, image captioning, multimodality

会議録・要旨集フリー

詳細

抄録

Vision and language as a vibrant multimodal machine learning research field aim to create models that serve comprehension of information across vision and language modalities. In this work, we utilized the multimodal Transformer model with joint text-vision representation to approach one of the vision and language tasks: news image caption generation. The multimodal Transformer model leverages context from the article with consideration of the scene in the associated image to generate caption. The experimental result demonstrated the multimodal Transformer significantly improved the quality of generated news image caption.

責任著者(Corresponding author)

会議情報

J-STAGEへの登録はこちら（無料）