論文ID: 2024EDP7313
Joint multimodal aspect-based sentiment analysis (JMABSA) aims to extract aspects from multimodal inputs and determine their sentiment polarity. Existing research often faces challenges in effectively aligning aspect features across images and text. To address this, we propose an entity knowledge-guided image-text alignment network that integrates alignment across both modalities, enabling the model to more accurately capture jointly expressed aspect and sentiment information in images and text. Specifically, we introduce an entity class embedding to guide the model in learning entity-related features from text. Additionally, we utilize scene and aspect descriptions in images as entity knowledge, helping the model learn entity-relevant features from visual input. The alignment between entity knowledge in images and the initial text further supports the model in learning consistent aspect and sentiment expressions across modalities. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance on two public datasets.