Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232

This article has now been updated. Please use the final version.

End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal
Hiroki MoriHironao Nishino
Author information
JOURNAL OPEN ACCESS Advance online publication

Article ID: e24.13

Details
Abstract

We propose an end-to-end conversational speech synthesis system that allows for flexible control of emotional states defined over emotion dimensions. We extend the Tacotron 2 and VITS architectures to accept emotion dimensions as input. Initially, the model is pre-trained using a large-scale spontaneous speech corpus, followed by fine-tuning using a natural dialogue speech corpus with manually annotated perceived emotion in the form of pleasantness and arousal. Since the pre-training lacks emotion information, we explore two pre-training strategies and demonstrate that applying an emotion dimension estimator before the pre-training enhances emotion controllability. Evaluation of the synthesized speech using VITS yields a mean opinion score of 4 or higher for naturalness. Furthermore, there is a correlation of R = 0.53 for pleasantness and R = 0.89 for arousal between the given and perceived emotional states. These results underscore the effectiveness of our proposed conversational speech synthesis system with emotion control.

Content from these authors
© 2024 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
feedback
Top