Host: The Japanese Society for Artificial Intelligence
Name : The 39th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 39
Location : [in Japanese]
Date : May 27, 2025 - May 30, 2025
This study aims to investigate the spatial reasoning capabilities of vision-language large models (VLLMs) and propose a novel approach to unlock their potential. By utilizing a multi-layered cognitive map and prompts incorporating spatial information, we explored methods to enhance the spatial reasoning abilities of VLLMs. The methodology involved constructing cognitive maps of varying resolutions and generating maps of flexible sizes. Additionally, question-answer pairs related to spatial scales and navigation were designed and presented to the models. For evaluation, we used the VSI-Bench dataset to compare LLaVA-OneVision and Gemini-1.5-Flash. The results indicated that cognitive maps with flexible sizes contributed to the improved performance of LLaVA-OneVision. On the other hand, closed-source models exhibited performance degradation when additional information was inaccurate. In conclusion, while VLLMs can grasp local spatial relationships, challenges remain in understanding global spatial structures. This study is particularly effective in enhancing spatial cognition in open-source models, and further performance improvements are promising through the development of datasets and the introduction of specialized tokens.