プロンプト学習を用いた複数ドメイン適応画像言語モデルの精度向上法

高 振宇; 山極 綾子; 後藤 正幸

doi:10.11517/pjsai.JSAI2024.0_1B3GS205

Abstract

Methods for analyzing image data associated with linguistic information have garnered recent attention but encounter challenges due to varying data quantities across different image domains. In response, LADS was proposed, a model trainable without relying on image data from domains with limited samples, utilizing the embedding space between images and text in image language models. While LADS often employs simple domain description text, adequate text can improve model performance. To tackle this issue, we introduce CoOp, a method that optimizes the domain text in CLIP to enhance accuracy. CoOp achieves this by learning prompts, improving vision language models, and elevating CLIP accuracy. We expect the resulting prompts to represent diverse domains within LADS effectively. Finally, we validate the efficacy of our proposed method by applying it to actual data, demonstrating its ability to address imbalanced data quantities across various image domains.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!