Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
Using Linguistic Formalism to Improve Real World Understanding for V&L Models: Case Study on Image Discrimination for Structurally Ambiguous Language Input
Lee SangmyeongSeitaro ShinagawaKoichiro YoshinoSatoshi Nakamura
Author information
JOURNAL FREE ACCESS

2025 Volume 32 Issue 2 Pages 598-632

Details
Abstract

In the context of Real World Understanding (RWU) for vision and language (V&L) models, accurately aligning language with the corresponding visual scene is critical. Since current models typically assume language inputs to be plain text, RWU faces potential issues with structural ambiguity, where a single sentence can have multiple meanings due to various phrase structures. This paper proposes to use linguistic formalism as input, which enriches language information and addresses the issue of structural ambiguity. Our focus is on the Contrastive Language-Image Pre-training (CLIP) model, a prominent V&L model, focusing on image discrimination tasks of RWU. Our experiments test various approaches to incorporating formalism into the CLIP model, depending on the type of formalism and its processing method. We aim to determine the effectiveness of formalism in discriminating ambiguous images and identify which formalism works best. Additionally, we employ a gradient-based method to gain insights into how formalism is interpreted within the model’s architecture.

Content from these authors
© 2025 The Association for Natural Language Processing
Previous article Next article
feedback
Top