Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
In home environments, where the location of objects frequently changes, it is important for robots to quickly and accurately grasp the latest positions of these objects. Therefore, this study deals with the OSMI-3D task, a task that involves identifying target objects based on instructions given by users. We propose a method for efficiently manipulating objects in a home environment using reference expression segmentation based on 3D point cloud data, utilizing both a visual foundation model and a multimodal LLM. The main novelty of this study is the introduction of a Scene Narrative Module. This module combines the multimodal LLM with existing image feature extractors to extract structural features from images while meditating language. In experiments, our method demonstrated superior performance over traditional baseline methods in terms of mean IoU and precision at 0.5-0.9, confirming its effectiveness in the OSMI-3D task.