Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
39th (2025)
Session ID : 3G5-GS-6-05
Conference information

Proposal of a Multimodal Emotion Recognition Model Based on the Fusion of Audio and Text
*Yue TANJiazheng ZHOUKazuyuki MATSUMOTOXin KANGMinoru YOSHIDA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Multimodal emotion recognition is a technology that integrates multiple modalities—such as audio, text, and images—to more comprehensively and accurately identify and analyze human emotions. In the field of AI-driven dialogue systems, it has become an indispensable technology for facilitating smooth interactions. By fusing data from different modalities, such as audio and text, it is possible to account for inter-modal interactions and correlations that are not captured in single-modal emotion analysis, thereby improving both the generalizability and accuracy of emotion recognition.In this study, we constructed a multimodal emotion analysis model based on the Transformer architecture, which takes audio and text as inputs. By concatenating the outputs of the respective Transformer encoders for audio and text and then applying a Self-Attention mechanism to the concatenated representation, our model can fuse these modalities while preserving their Cross-modal relationships. In this paper, we conduct comparative evaluation experiments against multiple existing methods on CMU-MOSEI, a standard dataset for emotion recognition tasks, to validate the performance of the proposed model and confirm the advantages of multimodal fusion for emotion recognition.

Content from these authors
© 2025 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top