Article ID: 2024EDP7173
Multi-modal entity alignment (MMEA) endeavors to ascertain whether two multi-modal entities originating from distinct knowledge graphs refer to a congruent real-world object. This alignment is a pivotal technique in knowledge graph fusion, which aims to enhance the overall richness and comprehensiveness of the knowledge base. Existing mainstream MMEA models predominantly leverage graph convolutional networks and pre-trained visual models to extract the structural and visual features of entities, subsequently proceeding to integrate these features and conduct similarity comparisons. However, given the often suboptimal quality of multi-modal information in knowledge graphs, reliance solely on traditional visual feature extraction methods and the extraction of visual and structural features alone may result in insufficient semantic information within the generated multi-modal joint embeddings of entities. This limitation could potentially hinder the accuracy and effectiveness of multi-modal entity alignment. To address the above issues, we propose MSEEA, a Multi-modal Entity Alignment method based on Multidimensional Semantic Extraction. First, MSEEA fine-tunes a large language model using preprocessed entity relationship triples, thereby enhancing its capacity to analyze latent semantic information embedded in structural triples and generate contextually rich entity descriptions. Second, MSEEA employs a combination of multiple advanced models and systems to extract multidimensional semantic information from the visual modality, thereby circumventing the feature quality degradation that can occur with reliance solely on pre-trained visual models. Finally, MSEEA integrates different modal embeddings of entities to generate multi-modal representations and compares their similarities. We designed and executed experiments on FB15K-DB15K/ YAGO15K, and the outcomes demonstrate that MSEEA outperforms traditional approaches, achieving state-of-the-art results.