Nonparallel Dictionary-Based Voice Conversion Using Variational Autoencoder with Modulation-Spectrum-Constrained Training

Tuan Vu Ho; Masato Akagi

doi:10.2299/jsp.22.189

抄録

In this paper, we present a nonparallel voice conversion (VC) approach that does not require parallel data or linguistic labeling for the training process. Dictionary-based voice conversion is a class of methods aiming to decompose speech into separate factors for manipulation. Non-negative matrix factorization (NMF) is the most common method to decompose an input spectrum into a weighted linear combination of a set comprising a dictionary (basis) and weights. However, the requirement for parallel training data in this method causes several problems: 1) limited practical usability when parallel data are not available, 2) the additional error from the alignment process degrades the output speech quality. To alleviate these problems, we present a dictionary-based VC approach by incorporating a variational autoencoder (VAE) to decompose an input speech spectrum into a speaker dictionary and weights without parallel training data. According to evaluation results, the proposed method achieves better speech naturalness while retaining the same speaker similarity as NMF-based VC even though unaligned data is used.

著者関連情報

お気に入り & アラート

閲覧履歴

蝦夷富士半月湖湖水の酸素含有量半月湖の研究(二)

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）