視覚大規模言語モデルの潜在力を引き出す 多層認知マップと空間情報プロンプトによる空間認知能力の向上

馮 奇

doi:10.11517/pjsai.JSAI2025.0_3N6GS704

Abstract

This study aims to investigate the spatial reasoning capabilities of vision-language large models (VLLMs) and propose a novel approach to unlock their potential. By utilizing a multi-layered cognitive map and prompts incorporating spatial information, we explored methods to enhance the spatial reasoning abilities of VLLMs. The methodology involved constructing cognitive maps of varying resolutions and generating maps of flexible sizes. Additionally, question-answer pairs related to spatial scales and navigation were designed and presented to the models. For evaluation, we used the VSI-Bench dataset to compare LLaVA-OneVision and Gemini-1.5-Flash. The results indicated that cognitive maps with flexible sizes contributed to the improved performance of LLaVA-OneVision. On the other hand, closed-source models exhibited performance degradation when additional information was inaccurate. In conclusion, while VLLMs can grasp local spatial relationships, challenges remain in understanding global spatial structures. This study is particularly effective in enhancing spatial cognition in open-source models, and further performance improvements are promising through the development of datasets and the introduction of specialized tokens.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!