Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
Machine learning-based Speech Emotion Recognition (SER) and Emotional Speech Synthesis have gained increasing popularity recently. However, preparing sufficient learning data that perfectly matches the intended use is challenging. One method to increase data volume is “data augmentation.” Various data augmentation methods are proposed in the fields of Automatic Speech Recognition (ASR) and Image Recognition (IR). This paper proposes increasing learning data through data augmentation methods from the ASR and IR fields. Five data augmentation techniques (Time Stretch, Frequency Masking, Time Masking, Frequency Warping, Low-latency Low-resource Voice Conversion (LLVC), and CopyPaste) are applied to machine learning data for SER and their effectiveness is compared. The experimentation results indicated that applying multiple data augmentation methods enhanced the performance of SER. Particularly, the combination of LLVC and CopyPaste improved the SER accuracy by 0.24 points from the baseline.