生成モデルは空間概念を捉えているか：意匠データを用いた空間理解タスクの提案

竹中 誠; 谷中 瞳

doi:10.11517/pjsai.JSAI2024.0_4Xin2111

Abstract

Based on various prior knowledge, humans can imagine how an object looks from different perspectives. In this paper, we propose a task to evaluate whether recent large multimodal models have this capability and attempt to analyze current models. Specifically, an image of an object from an isometric view and an image of the same object from a different perspective are input to the model, and the task is to question the viewpoints of the two images. The evaluation dataset was constructed using sketch images from the design patents database and their captions describing the viewpoint information as data sources. In the experiment, the evaluation dataset constructed for the GPT-4V is used to analyze the spatial reasoning ability. Based on the experiment results, we discuss the potential and challenges of the GPT-4V's spatial reasoning ability.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!