An Interpretable Multi-Level Feature Disentanglement Algorithm for Speech Emotion Recognition

Huawei TAO; Ziyi HU; Sixian LI; Chunhua ZHU; Peng LI; Yue XIE

doi:10.1587/transinf.2024EDL8083

Abstract

Speech Emotion Recognition (SER) plays a pivotal role in human-computer interaction, yet its performance is often hindered by the nonlinear entanglement of emotional and speaker features. This paper proposes an interpretable multi-level feature disentanglement algorithm for speech emotion recognition, aiming to effectively separate emotion features from individual speech. The algorithm first constructs a novel hybrid auto-encoder network that can separate static and dynamic emotional features from the features extracted by the self-supervised network emotion2vec, thereby obtaining multi-level and time-varying emotional feature representations. Additionally, we implement a multi-layer perceptual classifier based on Kolmogorov-Arnold Networks (KAN), which is adept at capturing complex nonlinear relationships in the data and further promote feature disentanglement. Experiments results on the IEMOCAP database show that our proposed algorithm achieves a WA value of 73.2%, surpassing the current state-of-the-art.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!