To build good conversation agents, an accurate conversation context is assumed to be required. We argue that a conversation scene that includes speakers can provide more information on the context because using images as conversation contexts has proven effective. We constructed a visual conversation scene dataset (VCSD) that provided scenic images corresponding to conversations. This dataset provides a combination of (1) conversation scene image (third-person view), (2) the corresponding first utterance and its response, and (3) the corresponding speaker, respondent, and topic object. In our experiments on the response-selection task, we first examined BERT (text only) as a baseline. Although BERT managed to perform well in general conversations, where a response continued from the previous utterance, it failed to deal with cases where visual information was necessary to understand the context. Our error analysis found that conversations requiring visual contexts can be categorized into three types: visual question-answering, image-referring response, and scene understanding. To optimize the usage of conversation scene images and their focused parts, that is, speaker, respondent, and topic object, we proposed a model that received texts and multiple image features as inputs. Our model can capture this information and achieve 91% accuracy.
View full abstract