Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 2O6-OS-16a-02
Conference information

Referring Expression Segmentation using Optimal Transport Polygon Matching with Multimodal Foundation Models
*Takayuki NISHIMURAKatsuyuki KUYOMotonari KAMBARAKomei SUGIURA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

In home environments, where the location of objects frequently changes, it is important for robots to quickly and accurately grasp the latest positions of these objects. Therefore, this study deals with the OSMI-3D task, a task that involves identifying target objects based on instructions given by users. We propose a method for efficiently manipulating objects in a home environment using reference expression segmentation based on 3D point cloud data, utilizing both a visual foundation model and a multimodal LLM. The main novelty of this study is the introduction of a Scene Narrative Module. This module combines the multimodal LLM with existing image feature extractors to extract structural features from images while meditating language. In experiments, our method demonstrated superior performance over traditional baseline methods in terms of mean IoU and precision at 0.5-0.9, confirming its effectiveness in the OSMI-3D task.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top