Anticipation Captioning with Commonsense Knowledge

Duc Minh VO; Quoc-An LUONG; Akihiro SUGIMOTO; Hideki NAKAYAMA

doi:10.11370/isj.62.588

抄録

In this review, we introduce a novel image captioning task, called Anticipation Captioning, which generates a caption for an unseen image given a sparsely temporally-ordered set of images. Our task emulates the human capacity to reason about the future based on a sparse collection of visual cues acquired over time. To address this novel challenge, we introduce a model, namely A-CAP, that predicts the caption by incorporating commonsense knowledge into a pre-trained vision-language model. Our method outperforms image captioning methods and provides a solid baseline for anticipation captioning task, as shown in both qualitative and quantitative evaluations on a customized visual storytelling dataset. We also discuss the potential applications, challenges, and future directions of this novel task.

著者関連情報

お気に入り & アラート

お気に入りに追加
追加情報アラート
被引用アラート
認証解除アラート

閲覧履歴

前身誌

電子写真

電子写真学会誌

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）