マルチモーダルLLMおよび視覚言語基盤モデルに基づく大規模物体操作データセットにおけるタスク成功判定

齋藤 大地; 神原 元就; 九曜 克之; 杉浦 孔明

doi:10.11517/pjsai.JSAI2024.0_3O1OS16b02

38th (2024)

Session ID : 3O1-OS-16b-02

DOI https://doi.org/10.11517/pjsai.JSAI2024.0_3O1OS16b02

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 38

Location : [in Japanese]

Date : May 28, 2024 - May 31, 2024

Task Success Prediction on Large-Scale Object Manipulation Datasets Based on Multimodal LLMs and Vision-Language Foundation Models

*Daichi SAITO, Motonari KAMBARA, Katsuyuki KUYO, Komei SUGIURA

Author information

Keywords: Manipulator, Object Manipulation, Vision-and-Language, Multimodal LLM, Success Prediction

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

For enhancing model performance in object manipulation tasks, high-performance prediction mechanisms for task success are crucial. However, existing methods are still insufficient in performance. Moreover, existing prediction mechanisms are designed to address only specific tasks, making it challenging to accommodate a diverse range of tasks. Therefore, our study aims to develop a task success prediction mechanism that can handle multiple object manipulation tasks. A key novelty of the proposed method is the introduction of the λ-Representation, which preserves all types of visual features: visual charactaristics such as colors and shapes; features aligned with natural language; features structured through natural language. For the experiments, we newly built datasets for task success prediction in object manipulation tasks based on the RT-1 dataset and VLMbench. The results show that the proposed method outperforms all baseline methods in accuracy.

Corresponding author

Conference information

Register with J-STAGE for free!