音声と言語のマルチモーダル融合に基づく感情認識モデルの提案

譚 悦; 周 嘉政; 松本 和幸; 康 鑫; 吉田 稔

doi:10.11517/pjsai.JSAI2025.0_3G5GS605

Abstract

Multimodal emotion recognition is a technology that integrates multiple modalities—such as audio, text, and images—to more comprehensively and accurately identify and analyze human emotions. In the field of AI-driven dialogue systems, it has become an indispensable technology for facilitating smooth interactions. By fusing data from different modalities, such as audio and text, it is possible to account for inter-modal interactions and correlations that are not captured in single-modal emotion analysis, thereby improving both the generalizability and accuracy of emotion recognition.In this study, we constructed a multimodal emotion analysis model based on the Transformer architecture, which takes audio and text as inputs. By concatenating the outputs of the respective Transformer encoders for audio and text and then applying a Self-Attention mechanism to the concatenated representation, our model can fuse these modalities while preserving their Cross-modal relationships. In this paper, we conduct comparative evaluation experiments against multiple existing methods on CMU-MOSEI, a standard dataset for emotion recognition tasks, to validate the performance of the proposed model and confirm the advantages of multimodal fusion for emotion recognition.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!