音声感情認識のための学習データ拡張手法の検討および分析

目良 和也; 坂根 剛; 黒澤 義明; 竹澤 寿幸

doi:10.11517/pjsai.JSAI2024.0_2L1OS9a03

Abstract

Machine learning-based Speech Emotion Recognition (SER) and Emotional Speech Synthesis have gained increasing popularity recently. However, preparing sufficient learning data that perfectly matches the intended use is challenging. One method to increase data volume is “data augmentation.” Various data augmentation methods are proposed in the fields of Automatic Speech Recognition (ASR) and Image Recognition (IR). This paper proposes increasing learning data through data augmentation methods from the ASR and IR fields. Five data augmentation techniques (Time Stretch, Frequency Masking, Time Masking, Frequency Warping, Low-latency Low-resource Voice Conversion (LLVC), and CopyPaste) are applied to machine learning data for SER and their effectiveness is compared. The experimentation results indicated that applying multiple data augmentation methods enhanced the performance of SER. Particularly, the combination of LLVC and CopyPaste improved the SER accuracy by 0.24 points from the baseline.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!