Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4Xin2-24
Conference information

Foundation Model that enables understanding of relative positions in human coordinate system based on ReCLIP
*Kenta IKEGAYARyo TAGUCHI
Author information
Keywords: multimodal AI
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

CLIP has been used in various tasks as an innovative model of mutual understanding between vision and language. However, previous studies have pointed out that CLIP encoders cannot output sufficiently correct spatial relationships between visual objects. From this point of view, it is considered that simple use of CLIP is insufficient for understanding relative positions linguistically. This study proposes a model for relative position understanding based on ReCLIP, which applies CLIP to the understanding of referential expressions that require spatial understanding. Through evaluation experiments using the RefGTA dataset, the proposed model shows a 1~2% improvement over ReCLIP for the spatial relationship "in front of". In addition, the proposed model shows a 12~13% improvement for data requiring depth- and orientation-based inference.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top