Cross-Dialectal Voice Conversion with Neural Networks

Weixun GAO; Qiying CAO; Yao QIAN

doi:10.1587/transinf.2014EDP7116

抄録

In this paper, we use neural networks (NNs) for cross-dialectal (Mandarin-Shanghainese) voice conversion using a bi-dialectal speakers' recordings. This system employs a nonlinear mapping function, which is trained by parallel mandarin features of source and target speakers, to convert source speaker's Shanghainese features to those of target speaker. This study investigates three training aspects: a) Frequency warping, which is supposed to be language independent; b) Pre-training, which drives weights to a better starting point than random initialization or be regarded as unsupervised feature learning; and c) Sequence training, which minimizes sequence-level errors and matches objectives used in training and converting. Experimental results show that the performance of cross-dialectal voice conversion is close to that of intra-dialectal. This benefit is likely from the strong learning capabilities of NNs, e.g., exploiting feature correlations between fundamental frequency (F0) and spectrum. The objective measures: log spectral distortion (LSD) and root mean squared error (RMSE) of F0, both show that pre-training and sequence training outperform the frame-level mean square error (MSE) training. The naturalness of the converted Shanghainese speech and the similarity between converted Shanghainese speech and target Mandarin speech are significantly improved.

著者関連情報

お気に入り & アラート

閲覧履歴

Clarification of Enhanced Hydroxyl Radical Production in Fenton Reaction with ATP/ADP Based on Luminol Chemiluminescence

発行機関からのお知らせ

PPV is available from https://globals.ieice.org/en_transactions/information

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）