Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
Improving Conversation Task with Visual Scene Dataset
Abdurrisyad FikriHiep V. LeTakashi MiyazakiManabu OkumuraNobuyuki Shimizu
Author information
JOURNAL FREE ACCESS

2022 Volume 29 Issue 1 Pages 166-186

Details
Abstract

To build good conversation agents, an accurate conversation context is assumed to be required. We argue that a conversation scene that includes speakers can provide more information on the context because using images as conversation contexts has proven effective. We constructed a visual conversation scene dataset (VCSD) that provided scenic images corresponding to conversations. This dataset provides a combination of (1) conversation scene image (third-person view), (2) the corresponding first utterance and its response, and (3) the corresponding speaker, respondent, and topic object. In our experiments on the response-selection task, we first examined BERT (text only) as a baseline. Although BERT managed to perform well in general conversations, where a response continued from the previous utterance, it failed to deal with cases where visual information was necessary to understand the context. Our error analysis found that conversations requiring visual contexts can be categorized into three types: visual question-answering, image-referring response, and scene understanding. To optimize the usage of conversation scene images and their focused parts, that is, speaker, respondent, and topic object, we proposed a model that received texts and multiple image features as inputs. Our model can capture this information and achieve 91% accuracy.

Content from these authors
© 2022 The Association for Natural Language Processing
Previous article Next article
feedback
Top