IEEJ Transactions on Electronics, Information and Systems
Online ISSN : 1348-8155
Print ISSN : 0385-4221
ISSN-L : 0385-4221
<Speech and Image Processing, Recognition>
Speech Synthesis based on Speaker Impression with Hierarchical Discriminator GAN
Yuto MoriKatsufumi InoueMichifumi Yoshioka
Author information
JOURNAL RESTRICTED ACCESS

2020 Volume 140 Issue 11 Pages 1207-1212

Details
Abstract

Recently, speech synthesis has been spotlighted as a key technology for broadcasting original movie with character on YouTube. To make a natural speech in the methods based on GAN(Generative Adversarial Network), the following unsolved problems are remained: impression of synthesized speech such as warm, cool, etc., and long-term optimization of speech synthesis. In the former problem, since the conventional methods have focused on natural intonation of speech, they have not discussed the impression sufficiently. In this research, to deal with the impression, we proposed a new GAN based speech synthesis method using impression vector digitized the speaker impression. On the other hand, for the latter problem, since conventional methods optimize the relationship among frames insufficiently, the synthesized speech is still not natural. To solve this problem, inspired by an image synthesis technology such as HDGAN, we proposed a new GAN based network structure. The characteristic point is hierarchically nested discriminators at intermediate layers of the generator. In experiments with 15 speeches synthesized by the proposed method and 14 impression items, we estimated impression recognition accuracy by 11 listeners as subjective evaluation. From the experimental results, we have achieved 40.61% of subjective accuracy.

Content from these authors
© 2020 by the Institute of Electrical Engineers of Japan
Previous article Next article
feedback
Top