Host: The Japanese Society for Artificial Intelligence
Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 37
Location : [in Japanese]
Date : June 06, 2023 - June 09, 2023
Inference between different modalities has been actively studied in recent years. We focus on Visual-textual Entailment (VTE), one of the most critical tasks for multimodal inference. A variety of deep learning-based approaches have been proposed for the VTE task, but they have difficulty in accurately handling numerals. In contrast, approaches based on logical inference can successfully deal with numerals. However, since the previous logic-based approaches use automated theorem provers, their computational cost significantly increases for problems involving many entities. In this paper, we propose a logic-based VTE system with model checking and knowledge injection. We create a dataset for the VTE task containing numerals and negation to evaluate the extent to which VTE systems correctly understand those phenomena. Using this dataset, we show that our system solves the VTE task with numerals and negation more robustly than the previous approaches.