論文ID: 2025EDP7083
Prompt learning automates the manual crafting of prompts for adapting vision-and-language models, to downstream tasks, particularly in few-shot scenarios. This paper addresses two key challenges in prompt learning: limited performance in one-shot settings and inefficient dataset construction from unlabeled data. To tackle these challenges, we visualize and compare CLIP's feature spaces after prompt learning under one-shot and 16-shot conditions, identifying necessary characteristics of feature spaces that yield better prompts. We propose two novel loss functions—Inclusive Loss and Exclusive Loss—that enhance accuracy in one-shot scenarios by encouraging the feature space to resemble those trained with sufficient data. Additionally, we investigate the distribution of image features within CLIP's feature space and introduce a sampling method called Cluster-Centroid Sampling (CCS). CCS constructs a more category-balanced dataset by selecting samples closest to cluster centroids. To validate our approaches, we conducted extensive experiments. First, we demonstrate the effectiveness of our proposed loss functions across multiple datasets, showing accuracy improvements in one-shot conditions. Second, we evaluate CCS using an unlabeled data pool, confirming its superiority over existing sampling methods in downstream task accuracy due to the construction of more balanced dataset.