2025 Volume E108.D Issue 7 Pages 841-844
Speech Emotion Recognition (SER) plays a pivotal role in human-computer interaction, yet its performance is often hindered by the nonlinear entanglement of emotional and speaker features. This paper proposes an interpretable multi-level feature disentanglement algorithm for speech emotion recognition, aiming to effectively separate emotion features from individual speech. The algorithm first constructs a novel hybrid auto-encoder network that can separate static and dynamic emotional features from the features extracted by the self-supervised network emotion2vec, thereby obtaining multi-level and time-varying emotional feature representations. Additionally, we implement a multi-layer perceptual classifier based on Kolmogorov-Arnold Networks (KAN), which is adept at capturing complex nonlinear relationships in the data and further promote feature disentanglement. Experiments results on the IEMOCAP database show that our proposed algorithm achieves a WA value of 73.2%, surpassing the current state-of-the-art.