IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Regular Section
Speech Chain VC: Linking Linguistic and Acoustic Levels via Latent Distinctive Features for RBM-Based Voice Conversion
Takuya KISHIDAToru NAKASHIKA
Author information
JOURNAL FREE ACCESS

2020 Volume E103.D Issue 11 Pages 2340-2350

Details
Abstract

This paper proposes a voice conversion (VC) method based on a model that links linguistic and acoustic representations via latent phonological distinctive features. Our method, called speech chain VC, is inspired by the concept of the speech chain, where speech communication consists of a chain of events linking the speaker's brain with the listener's brain. We assume that speaker identity information, which appears in the acoustic level, is embedded in two steps — where phonological information is encoded into articulatory movements (linguistic to physiological) and where articulatory movements generate sound waves (physiological to acoustic). Speech chain VC represents these event links by using an adaptive restricted Boltzmann machine (ARBM) introducing phoneme labels and acoustic features as two classes of visible units and latent phonological distinctive features associated with articulatory movements as hidden units. Subjective evaluation experiments showed that intelligibility of the converted speech significantly improved compared with the conventional ARBM-based method. The speaker-identity conversion quality of the proposed method was comparable to that of a Gaussian mixture model (GMM)-based method. Analyses on the representations of the hidden layer of the speech chain VC model supported that some of the hidden units actually correspond to phonological distinctive features. Final part of this paper proposes approaches to achieve one-shot VC by using the speech chain VC model. Subjective evaluation experiments showed that when a target speaker is the same gender as a source speaker, the proposed methods can achieve one-shot VC based on each single source and target speaker's utterance.

Content from these authors
© 2020 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top