Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
39th (2025)
Session ID : 1Win4-52
Conference information

Scene Text Aware Multimodal Retrieval for Everyday Objects Based on Crosslingual Visual Prompts
*Kento TOKURAKorekata RYOSUKEKomatsu TAKUMIYuto IMAIKomei SUGIURA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

This study explores a task where a robot searches for images containing target objects based on user language queries from a large set of images captured in diverse indoor and outdoor environments. Both images with and without scene text are considered. For example, when searching with the query, "Pass me the red container of Sun-Maid raisins on the kitchen counter," the model ranks images containing a container labeled "Sun-Maid raisins" on the kitchen counter higher. However, linking visual semantics with scene text is challenging. Additionally, multimodal search requires large-scale, high-speed inference, making it impractical to rely solely on a multimodal large language model (MLLM). To address this, we introduce a Scene Text Visual Encoder, integrating an Aligned Representation with a narrative representation obtained using an MLLM based on Crosslingual Visual Prompting. Incorporating OCR results into the prompt further reduces hallucination. Experiments show that the proposed method outperforms multimodal foundation models across multiple benchmarks in standard evaluation metrics for ranking-based learning.

Content from these authors
© 2025 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top