Crosslingual Visual Promptに基づくテキスト付き画像からの日常物体検索

戸倉 健登; 是方 諒介; 小松 拓実; 今井 悠人; 杉浦 孔明

doi:10.11517/pjsai.JSAI2025.0_1Win452

Abstract

This study explores a task where a robot searches for images containing target objects based on user language queries from a large set of images captured in diverse indoor and outdoor environments. Both images with and without scene text are considered. For example, when searching with the query, "Pass me the red container of Sun-Maid raisins on the kitchen counter," the model ranks images containing a container labeled "Sun-Maid raisins" on the kitchen counter higher. However, linking visual semantics with scene text is challenging. Additionally, multimodal search requires large-scale, high-speed inference, making it impractical to rely solely on a multimodal large language model (MLLM). To address this, we introduce a Scene Text Visual Encoder, integrating an Aligned Representation with a narrative representation obtained using an MLLM based on Crosslingual Visual Prompting. Incorporating OCR results into the prompt further reduces hallucination. Experiments show that the proposed method outperforms multimodal foundation models across multiple benchmarks in standard evaluation metrics for ranking-based learning.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!