J-CRe3: 実世界における参照関係解決のための 日本語対話データセット

植田 暢大; 波部 英子; 松井 陽子; 湯口 彰重; 河野 誠也; 川西 康友; 黒橋 禎夫; 吉野 幸一郎

doi:10.5715/jnlp.31.1107

Abstract

Understanding the situation in the physical world is crucial for robots assisting humans in the real world. Especially, when a robot is required to collaborate with humans through verbal interactions, such as dialogues, the verbal information that appears in user interactions must be grounded in the visual information observed in egocentric views. To this end, we proposed a multimodal reference resolution task and constructed a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric videos and dialogue audio from real-world conversations between two people acting as a master and an assistant robot at home. It is annotated with crossmodal tags between phrases in the utterances and object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model that combined an existing textual reference resolution model with a phrase grounding model. Our experiments with this model showed that crossmodal reference resolution is significantly more challenging than textual reference resolution in the proposed task.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!