Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
Multimodal models, which combine information from multiple modalities such as images and text and reason about them, have been proposed recently and have achieved high performance in various multimodal reasoning tasks. In this study, we focus on one such task, Visual-Textual Entailment (VTE), which predicts the entailment relationship between images and sentences. VTE is suitable for measuring a model's multimodal reasoning ability because solving VTE tasks requires understanding the information in images and the meaning of sentences and combining them for reasoning. The extent to which multimodal models capture linguistic phenomena such as quantity and negation in sentences and their reasoning ability in languages other than English have yet to be thoroughly evaluated. Therefore, this study proposes two multilingual VTE benchmarks focusing on linguistic phenomena. We evaluate two multimodal models using the proposed benchmarks. The results showed that the models have challenges in reasoning in Japanese compared to other languages and that there is room for improvement in understanding quantity and negation in sentences.