Applying generation process model constraint to fundamental frequency contours generated by hidden-Markov-model-based speech synthesis

Tetsuya Matsuda; Keikichi Hirose; Nobuaki Minematsu

doi:10.1250/ast.33.221

Abstract

Speech synthesis based on hidden Markov models (HMMs) processes both segmental and prosodic features of speech together in a frame-by-frame manner. One benefit of this method is that time alignment of both features is kept automatically. However, when the training data are limited, frame-by-frame representation is not appropriate for prosodic features, which tightly related to speech units spreading a wide time span, such as words, phrases and so on. This causes an inherit problem in fundamental frequency (F₀) contour generation by HMM-based speech synthesis. A method is developed to modify F₀ contours in the framework of generation process model (henceforth, F₀ model) by referring to linguistic information of input text (word boundary and accent type). It takes F₀ variances obtained through HMM-based speech synthesis into account during the process. Through a listening experiment on synthetic speech, the method is proved to generate better quality as compared to the HMM-based speech synthesis on average. Since the F₀ model can clearly relate its commands and linguistic (and para-/non- linguistic) information, the method has an additional advantage; changing speech styles, and/or adding further information (such as emphasis) can be easily done through manipulating the commands.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!