Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
CLIP has been used in various tasks as an innovative model of mutual understanding between vision and language. However, previous studies have pointed out that CLIP encoders cannot output sufficiently correct spatial relationships between visual objects. From this point of view, it is considered that simple use of CLIP is insufficient for understanding relative positions linguistically. This study proposes a model for relative position understanding based on ReCLIP, which applies CLIP to the understanding of referential expressions that require spatial understanding. Through evaluation experiments using the RefGTA dataset, the proposed model shows a 1~2% improvement over ReCLIP for the spatial relationship "in front of". In addition, the proposed model shows a 12~13% improvement for data requiring depth- and orientation-based inference.