Determining the base frequency of the F0 contour generation model for the diverse expression of speech

Yoshiko Arimoto; Yasuo Horiuchi; Sumio Ohno

doi:10.1250/ast.e24.05

Abstract

A reliable method of determining the base frequency (F_b) for utterances of various speaking styles is critical to enabling stable command labeling in the Fujisaki model. To achieve stable command labeling for diverse expressions of speech, a linear fitted model was developed using the ten percentile F₀ of each utterance from three corpora of various speaking styles (read, acted, and spontaneous) as the independent variable to estimate a consistent F_b for each utterance. To assess the robustness of the model for unknown utterances, the model was applied to test data, including both open and corpus-open data not used for the model development, and the difference between the estimated F_b and the trained labelers' annotated F_b was calculated. As a result, the obtained estimation model was found to fit well to the manually labeled F_bs by exhibiting a small root mean squared error (RMSE) of 0.096 and a high coefficient of determination (R²) of 0.89 for the closed dataset. Moreover, the model also exhibited a small RMSE of 0.091 and a high R² of 0.92 for the corpus-open dataset. The results revealed that the proposed model can reliably estimate the F_b of utterances with various speaking styles.

Content from these authors

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nd/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!