Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
In the field of image captioning, constructing automatic evaluation metrics that align closely with human judgment is crucial for effective model development. A key challenge in this field is addressing hallucinations, which are instances where models generate words unrelated to the image, a frequent issue in image captioning. Existing metrics often fail to manage hallucinations, primarily due to their limited capability in contrasting candidate captions against a diverse range of reference captions. To overcome this, we propose DENEB, a novel metric for image captioning, specifically robust to hallucinations. DENEB incorporates the Sim-Vec Transformer, a mechanism capable of processing multiple references and extracting similarity vectors effectively. Additionally, to train DENEB, we have expanded the Polaris dataset to create Polaris2.0, significantly enhancing supervised automatic evaluation metrics. Our dataset comprises 32,978 images and 32,978 human judgments from 805 annotators. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris 2.0 dataset, thereby demonstrating its effectiveness and robustness to hallucinations.