Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
PAPER
End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal
Hiroki MoriHironao Nishino
Author information
JOURNAL OPEN ACCESS

2025 Volume 46 Issue 1 Pages 70-77

Details
Abstract

We propose an end-to-end conversational speech synthesis system that allows for flexible control of emotional states defined over emotion dimensions. We extend the Tacotron 2 and VITS architectures to accept emotion dimensions as input. Initially, the model is pre-trained using a large-scale spontaneous speech corpus, followed by fine-tuning using a natural dialogue speech corpus with manually annotated perceived emotion in the form of pleasantness and arousal. Since the pre-training lacks emotion information, we explore two pre-training strategies and demonstrate that applying an emotion dimension estimator before the pre-training enhances emotion controllability. Evaluation of the synthesized speech using VITS yields a mean opinion score of 4 or higher for naturalness. Furthermore, there is a correlation of R=0.53 for pleasantness and R=0.89 for arousal between the given and perceived emotional states. These results underscore the effectiveness of our proposed conversational speech synthesis system with emotion control.

Content from these authors
© 2025 by The Acoustical Society of Japan

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/
Previous article Next article
feedback
Top