話者の印象を考慮した階層的識別器を持つGANによる音声合成

森 優斗; 井上 勝文; 吉岡 理文

doi:10.1541/ieejeiss.140.1207

Abstract

Recently, speech synthesis has been spotlighted as a key technology for broadcasting original movie with character on YouTube. To make a natural speech in the methods based on GAN(Generative Adversarial Network), the following unsolved problems are remained: impression of synthesized speech such as warm, cool, etc., and long-term optimization of speech synthesis. In the former problem, since the conventional methods have focused on natural intonation of speech, they have not discussed the impression sufficiently. In this research, to deal with the impression, we proposed a new GAN based speech synthesis method using impression vector digitized the speaker impression. On the other hand, for the latter problem, since conventional methods optimize the relationship among frames insufficiently, the synthesized speech is still not natural. To solve this problem, inspired by an image synthesis technology such as HDGAN, we proposed a new GAN based network structure. The characteristic point is hierarchically nested discriminators at intermediate layers of the generator. In experiments with 15 speeches synthesized by the proposed method and 14 impression items, we estimated impression recognition accuracy by 11 listeners as subjective evaluation. From the experimental results, we have achieved 40.61% of subjective accuracy.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!