Article ID: 2024EDP7161
Multimodal Emotion Recognition (MER) is a critical task in sentiment analysis. Current methods primarily focus on multimodal fusion and representation of emotions, but they fail to capture the collaborative interaction in modalities effectively. In this study, we propose an MER model with intra-modal enhancement and inter-modal interaction (IEII). Firstly, this model extracts emotion information through RoBERTa, openSMILE, and DenseNet architectures from text, audio and video modalities respectively. The model designs the Large Enhanced Kernel Attention (LEKA) module which utilizes a simplified attention mechanism with large convolutional kernels, enhances intra-modal emotional information, and aligns modalities effectively. Then the multimodal representation space is proposed, which is constructed with transformer encoders to explore intermodal interactions. Finally, the model designs a Dual-Branch Multimodal Attention Fusion (DMAF) module based on grouped query attention and rapid attention mechanisms. The DMAF module integrates multimodal emotion representations and realizes the MER. The experimental results indicate that the model achieves superior overall accuracy and F1-scores on the IEMOCAP and MELD datasets compared to existing methods. It proved that the proposed model effectively enhances intra-modal emotional information and captures inter-modal interactions.