ReCLIPを基にした人座標系における相対位置理解が可能な基盤モデル

池ヶ谷 健太; 田口 亮

doi:10.11517/pjsai.JSAI2024.0_4Xin224

Abstract

CLIP has been used in various tasks as an innovative model of mutual understanding between vision and language. However, previous studies have pointed out that CLIP encoders cannot output sufficiently correct spatial relationships between visual objects. From this point of view, it is considered that simple use of CLIP is insufficient for understanding relative positions linguistically. This study proposes a model for relative position understanding based on ReCLIP, which applies CLIP to the understanding of referential expressions that require spatial understanding. Through evaluation experiments using the RefGTA dataset, the proposed model shows a 1~2% improvement over ReCLIP for the spatial relationship "in front of". In addition, the proposed model shows a 12~13% improvement for data requiring depth- and orientation-based inference.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!