多様な言語現象を考慮した多言語VTEベンチマークの提案

五百川 展行; WIJNHOLDS Gijs; 谷中 瞳

doi:10.11517/pjsai.JSAI2024.0_4C3GS1104

Abstract

Multimodal models, which combine information from multiple modalities such as images and text and reason about them, have been proposed recently and have achieved high performance in various multimodal reasoning tasks. In this study, we focus on one such task, Visual-Textual Entailment (VTE), which predicts the entailment relationship between images and sentences. VTE is suitable for measuring a model's multimodal reasoning ability because solving VTE tasks requires understanding the information in images and the meaning of sentences and combining them for reasoning. The extent to which multimodal models capture linguistic phenomena such as quantity and negation in sentences and their reasoning ability in languages other than English have yet to be thoroughly evaluated. Therefore, this study proposes two multilingual VTE benchmarks focusing on linguistic phenomena. We evaluate two multimodal models using the proposed benchmarks. The results showed that the models have challenges in reasoning in Japanese compared to other languages and that there is room for improvement in understanding quantity and negation in sentences.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information