IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532

This article has now been updated. Please use the final version.

Multimodal Speech Emotion Recognition Based on Large Language Model
Congcong FANGYun JINGuanlin CHENYunfan ZHANGShidang LIYong MAYue XIE
Author information
JOURNAL FREE ACCESS Advance online publication

Article ID: 2024EDL8034

Details
Abstract

Currently, an increasing number of tasks in speech emotion recognition rely on the analysis of both speech and text features. However, there remains a paucity of research exploring the potential of leveraging large language models like GPT-3 to enhance emotion recognition. In this investigation, we harness the power of the GPT-3 model to extract semantic information from transcribed texts, generating text modal features with a dimensionality of 1536. Subsequently, we perform feature fusion, combining the 1536-dimensional text features with 1188-dimensional acoustic features to yield comprehensive multi-modal recognition outcomes. Our findings reveal that the proposed method achieves a weighted accuracy of 79.62% across the four emotion categories in IEMOCAP, underscoring the considerable enhancement in emotion recognition accuracy facilitated by integrating large language models.

Content from these authors
© 2024 The Institute of Electronics, Information and Communication Engineers
feedback
Top