Making an English Speech Resemble the User’s Voice Using UTAU and Interactive Evolutionary Computation

Taichi MIYAMOTO; Haoran GAN; Makoto FUKUMOTO

doi:10.5057/isase.2022-C000024

Abstract

In general, learning English is difficult for non-native speakers because of the differences in vowels and consonants. There are some ways to practice English pronunciation such as shadowing, however, if the audio’s voice features greatly differ from the learner’s voice, it might impede learning and sound reproduction. In order to solve this problem, we propose a method to make the pronunciation data of the model pronunciation resemble the learner’s own voice by using UTAU and Interactive Evolutionary Computation. As a result of the experiments, we found that this method was capable of searching for highly evaluated solutions. The Wilcoxon signed-rank test was used to examine the statistical difference between the evaluations of the initial and final generations, and a significant difference was observed at P<0.01. Regarding to the pitch parameters, we could find different tendencies between males and females. This means the parameters were actually making the voice similar to examinee’s voice. However, there were some problems, such as the parameters that did not work well, the UTAU voice quality, the lack of female examinees, and so on. We plan to eliminate or at least reduce the effects from those problems in future experiments and make a better system for English learners so that they can learn more efficiently.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!