A Multitask Learning Approach Based on Cascaded Attention Network And Self-Adaption Loss for Speech Emotion Recognition

Yang LIU; Yuqi XIA; Haoqin SUN; Xiaolei MENG; Jianxiong BAI; Wenbo GUAN; Zhen ZHAO; Yongwei LI

doi:10.1587/transfun.2022EAP1091

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

A Multitask Learning Approach Based on Cascaded Attention Network And Self-Adaption Loss for Speech Emotion Recognition

Yang LIU, Yuqi XIA, Haoqin SUN, Xiaolei MENG, Jianxiong BAI, Wenbo GUAN, Zhen ZHAO, Yongwei LI

著者情報

Yang LIU
school of Information Science and technology, Qingdao University of Science and Technology
Yuqi XIA
school of Information Science and technology, Qingdao University of Science and Technology
Haoqin SUN
school of Information Science and technology, Qingdao University of Science and Technology
Xiaolei MENG
school of Information Science and technology, Qingdao University of Science and Technology
Jianxiong BAI
school of Information Science and technology, Qingdao University of Science and Technology
Wenbo GUAN
school of Information Science and technology, Qingdao University of Science and Technology
Zhen ZHAO
school of Information Science and technology, Qingdao University of Science and Technology
Yongwei LI
national Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences

キーワード: Speech Emotion Recognition, Non-Personalized Features, Cascaded Attention Network, Multitask Learning, Self-Adaption Loss

ジャーナルフリー早期公開

論文ID: 2022EAP1091

DOI https://doi.org/10.1587/transfun.2022EAP1091

この記事には本公開記事があります。

The final version of this article is now available: Vol. E106.A (2023), No. 6 pp. 876-885

詳細

抄録

Speech emotion recognition (SER) has been a complex and difficult task for a long time due to emotional complexity. In this paper, we propose a multitask deep learning approach based on cascaded attention network and self-adaption loss for SER. Frist, non-personalized features are extracted to represent the process of emotion change while reducing external variables' influence. Second, to highlight salient speech emotion features, a cascade attention network is proposed, where spatial temporal attention can effectively locate the regions of speech that express emotion, while self-attention reduces the dependence on external information. Finally, the influence brought by the differences in gender and human perception of external information is alleviated by using a multitask learning strategy, where a self-adaption loss is introduced to determine the weights of different tasks dynamically. Experimental results on IEMOCAP dataset demonstrate that our method gains an absolute improvement of 1.97% and 0.91% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）