Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Nobuhiro UedaHideko HabeYoko MatsuiAkishige YuguchiSeiya KawanoYasutomo KawanishiSadao KurohashiKoichiro Yoshino
Author information
JOURNAL FREE ACCESS

2024 Volume 31 Issue 3 Pages 1107-1139

Details
Abstract

Understanding the situation in the physical world is crucial for robots assisting humans in the real world. Especially, when a robot is required to collaborate with humans through verbal interactions, such as dialogues, the verbal information that appears in user interactions must be grounded in the visual information observed in egocentric views. To this end, we proposed a multimodal reference resolution task and constructed a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric videos and dialogue audio from real-world conversations between two people acting as a master and an assistant robot at home. It is annotated with crossmodal tags between phrases in the utterances and object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model that combined an existing textual reference resolution model with a phrase grounding model. Our experiments with this model showed that crossmodal reference resolution is significantly more challenging than textual reference resolution in the proposed task.

Content from these authors
© 2024 The Association for Natural Language Processing
Previous article Next article
feedback
Top