Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4C3-GS-11-04
Conference information

Multilingual Visual-Textual Entailment Benchmark with Diverse Linguistic Phenomena
*Nobuyuki IOKAWAGijs WIJNHOLDSHitomi YANAKA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Multimodal models, which combine information from multiple modalities such as images and text and reason about them, have been proposed recently and have achieved high performance in various multimodal reasoning tasks. In this study, we focus on one such task, Visual-Textual Entailment (VTE), which predicts the entailment relationship between images and sentences. VTE is suitable for measuring a model's multimodal reasoning ability because solving VTE tasks requires understanding the information in images and the meaning of sentences and combining them for reasoning. The extent to which multimodal models capture linguistic phenomena such as quantity and negation in sentences and their reasoning ability in languages other than English have yet to be thoroughly evaluated. Therefore, this study proposes two multilingual VTE benchmarks focusing on linguistic phenomena. We evaluate two multimodal models using the proposed benchmarks. The results showed that the models have challenges in reasoning in Japanese compared to other languages and that there is room for improvement in understanding quantity and negation in sentences.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top