Multimodal Speech Emotion Recognition Based on Large Language Model

Congcong FANG; Yun JIN; Guanlin CHEN; Yunfan ZHANG; Shidang LI; Yong MA; Yue XIE

doi:10.1587/transinf.2024EDL8034

Abstract

Currently, an increasing number of tasks in speech emotion recognition rely on the analysis of both speech and text features. However, there remains a paucity of research exploring the potential of leveraging large language models like GPT-3 to enhance emotion recognition. In this investigation, we harness the power of the GPT-3 model to extract semantic information from transcribed texts, generating text modal features with a dimensionality of 1536. Subsequently, we perform feature fusion, combining the 1536-dimensional text features with 1188-dimensional acoustic features to yield comprehensive multi-modal recognition outcomes. Our findings reveal that the proposed method achieves a weighted accuracy of 79.62% across the four emotion categories in IEMOCAP, underscoring the considerable enhancement in emotion recognition accuracy facilitated by integrating large language models.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!