自然言語処理
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
一般論文(査読有)
Using Linguistic Formalism to Improve Real World Understanding for V&L Models: Case Study on Image Discrimination for Structurally Ambiguous Language Input
Lee SangmyeongSeitaro ShinagawaKoichiro YoshinoSatoshi Nakamura
著者情報
ジャーナル フリー

2025 年 32 巻 2 号 p. 598-632

詳細
抄録

In the context of Real World Understanding (RWU) for vision and language (V&L) models, accurately aligning language with the corresponding visual scene is critical. Since current models typically assume language inputs to be plain text, RWU faces potential issues with structural ambiguity, where a single sentence can have multiple meanings due to various phrase structures. This paper proposes to use linguistic formalism as input, which enriches language information and addresses the issue of structural ambiguity. Our focus is on the Contrastive Language-Image Pre-training (CLIP) model, a prominent V&L model, focusing on image discrimination tasks of RWU. Our experiments test various approaches to incorporating formalism into the CLIP model, depending on the type of formalism and its processing method. We aim to determine the effectiveness of formalism in discriminating ambiguous images and identify which formalism works best. Additionally, we employ a gradient-based method to gain insights into how formalism is interpreted within the model’s architecture.

著者関連情報
© 2025 The Association for Natural Language Processing
前の記事 次の記事
feedback
Top