Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
39th (2025)
Session ID : 4Q1-GS-10-01
Conference information

Improving Retrieval Accuracy of Multimodal RAG Systems by Generating Search Insights from Image-Containing Documents
*Taku FUKUISatoshi MUNAKATA
Author information
Keywords: AI, RAG, multimodal
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

To improve business processes, Retrieval-Augmented Generation (RAG) applied to internal documents ideally allows AI to generate insights regarding the intent and purpose of tasks, and then retrieve and answer using relevant documents. However, conventional RAG relies on similarity between query and document embeddings, making it difficult to retrieve information from image-containing documents where such insights are not explicitly stated. Existing Multi-Representation-Indexing methods, which convert image captions into embeddings, also lack this insight generation capability. This study proposes a novel method that generates insight sentences from image-containing documents to enhance retrieval. Documents are decomposed page-by-page; for each page, an image caption and subsequent insight sentences are generated, along with anticipated question-answer pairs. These are then converted into embeddings. Experiments using open datasets demonstrate that incorporating these generated insights improves retrieval accuracy compared to conventional approaches.

Content from these authors
© 2025 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top