画像を含む文書から検索用洞察を生成することによるマルチモーダルRAGシステムの検索精度の改善

福井 琢; 宗像 聡

doi:10.11517/pjsai.JSAI2025.0_4Q1GS1001

Abstract

To improve business processes, Retrieval-Augmented Generation (RAG) applied to internal documents ideally allows AI to generate insights regarding the intent and purpose of tasks, and then retrieve and answer using relevant documents. However, conventional RAG relies on similarity between query and document embeddings, making it difficult to retrieve information from image-containing documents where such insights are not explicitly stated. Existing Multi-Representation-Indexing methods, which convert image captions into embeddings, also lack this insight generation capability. This study proposes a novel method that generates insight sentences from image-containing documents to enhance retrieval. Documents are decomposed page-by-page; for each page, an image caption and subsequent insight sentences are generated, along with anticipated question-answer pairs. These are then converted into embeddings. Experiments using open datasets demonstrate that incorporating these generated insights improves retrieval accuracy compared to conventional approaches.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!