End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal

Hiroki Mori; Hironao Nishino

doi:10.1250/ast.e24.13

Abstract

We propose an end-to-end conversational speech synthesis system that allows for flexible control of emotional states defined over emotion dimensions. We extend the Tacotron 2 and VITS architectures to accept emotion dimensions as input. Initially, the model is pre-trained using a large-scale spontaneous speech corpus, followed by fine-tuning using a natural dialogue speech corpus with manually annotated perceived emotion in the form of pleasantness and arousal. Since the pre-training lacks emotion information, we explore two pre-training strategies and demonstrate that applying an emotion dimension estimator before the pre-training enhances emotion controllability. Evaluation of the synthesized speech using VITS yields a mean opinion score of 4 or higher for naturalness. Furthermore, there is a correlation of R = 0.53 for pleasantness and R = 0.89 for arousal between the given and perceived emotional states. These results underscore the effectiveness of our proposed conversational speech synthesis system with emotion control.

Content from these authors

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!