2024 Volume 31 Issue 3 Pages 1107-1139
Understanding the situation in the physical world is crucial for robots assisting humans in the real world. Especially, when a robot is required to collaborate with humans through verbal interactions, such as dialogues, the verbal information that appears in user interactions must be grounded in the visual information observed in egocentric views. To this end, we proposed a multimodal reference resolution task and constructed a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric videos and dialogue audio from real-world conversations between two people acting as a master and an assistant robot at home. It is annotated with crossmodal tags between phrases in the utterances and object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model that combined an existing textual reference resolution model with a phrase grounding model. Our experiments with this model showed that crossmodal reference resolution is significantly more challenging than textual reference resolution in the proposed task.