2020 Volume 140 Issue 11 Pages 1207-1212
Recently, speech synthesis has been spotlighted as a key technology for broadcasting original movie with character on YouTube. To make a natural speech in the methods based on GAN(Generative Adversarial Network), the following unsolved problems are remained: impression of synthesized speech such as warm, cool, etc., and long-term optimization of speech synthesis. In the former problem, since the conventional methods have focused on natural intonation of speech, they have not discussed the impression sufficiently. In this research, to deal with the impression, we proposed a new GAN based speech synthesis method using impression vector digitized the speaker impression. On the other hand, for the latter problem, since conventional methods optimize the relationship among frames insufficiently, the synthesized speech is still not natural. To solve this problem, inspired by an image synthesis technology such as HDGAN, we proposed a new GAN based network structure. The characteristic point is hierarchically nested discriminators at intermediate layers of the generator. In experiments with 15 speeches synthesized by the proposed method and 14 impression items, we estimated impression recognition accuracy by 11 listeners as subjective evaluation. From the experimental results, we have achieved 40.61% of subjective accuracy.
The transactions of the Institute of Electrical Engineers of Japan.C
The Journal of the Institute of Electrical Engineers of Japan