マルチモーダル基盤モデルと最適輸送を用いたポリゴンマッチングによる参照表現セグメンテーション

西村 喬行; 九曜 克之; 神原 元就; 杉浦 孔明

doi:10.11517/pjsai.JSAI2024.0_2O6OS16a02

Abstract

In home environments, where the location of objects frequently changes, it is important for robots to quickly and accurately grasp the latest positions of these objects. Therefore, this study deals with the OSMI-3D task, a task that involves identifying target objects based on instructions given by users. We propose a method for efficiently manipulating objects in a home environment using reference expression segmentation based on 3D point cloud data, utilizing both a visual foundation model and a multimodal LLM. The main novelty of this study is the introduction of a Scene Narrative Module. This module combines the multimodal LLM with existing image feature extractors to extract structural features from images while meditating language. In experiments, our method demonstrated superior performance over traditional baseline methods in terms of mean IoU and precision at 0.5-0.9, confirming its effectiveness in the OSMI-3D task.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!