Cross-Modal Deep Interaction and Semantic Aligning for Image-Text Retrieval

Ruidong CHEN; Baohua QIANG; Xianyi YANG; Shihao ZHANG; Yuan XIE

doi:10.1587/transinf.2024EDP7279

Abstract

Image-text retrieval (ITR) aims at querying one type of data based on a given another type of data. The main challenge is mapping images and texts to a common space. Although existing methods obtain excellent performance on ITR tasks, they also have the drawbacks of weak information interaction and insufficient capture of deeper associative relationships. To address these problems, we propose CDISA: a Cross-modal Deep Interaction and Semantic Aligning method by combining vision-language pre-training model with semantic feature extraction capabilities. Specifically, we first design a cross-modal deep interaction module to enhance the interaction of image and text features by performing deep interaction matching computations. Secondly, to align the image and text features, bidirectional cosine matching is proposed to improve the differentiation of bimodal data within the feature space. We propose arguably the extensive experimental evaluation against recent state-of-the-art ITR methods on three datasets which include Wikipedia, Pascal-Sentence and NUS-WIDE.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!