Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
PAPERS
Multi-setting acoustic feature training for data augmentation of speech recognition
Sei UenoAkinobu Lee
著者情報
ジャーナル オープンアクセス

2024 年 45 巻 4 号 p. 195-203

詳細
抄録

This paper presents simple multi-setting log Mel-scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end-to-end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi-setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi-setting lmfb features with its setting ID embedded in the text-to-Mel network. Experimental evaluations showed that both proposed multi-setting training methods achieved better ASR performance than the baseline single-setting training augmentation methods.

著者関連情報
© 2024 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
前の記事 次の記事
feedback
Top