THE JOURNAL OF THE ACOUSTICAL SOCIETY OF JAPAN
Online ISSN : 2432-2040
Print ISSN : 0369-4232
An Experiment on Male to Female Voice Conversion
Teruo YasuhiroKazuhiko Ozeki
Author information
JOURNAL FREE ACCESS

1976 Volume 32 Issue 6 Pages 362-368

Details
Abstract
This paper describes a method of male to female voice conversation as an application of speech analysis and synthesis by liner predication. The method was demonstrated in the open house of the NHK Technical Research Lab's in 1975, where a synthesized female voice was presented, the original of which was a sentence from a weather forecast announcement spoken by a male announcer. The average format frequencies of female voices are approximately 1. 2 times as high as those of male voice as shown in Fig. 2, and the average bandwidths of the first format of female voices is approximately 1. 3 times as wide as that of male voices as shown in Fig. 3. In this experiment, both the pole frequencies and the bandwidths of the input speech spectra were multiplied by 1. 3 by simply setting the sampling frequency of the D/A converter at the value of 1. 3 times as high as that of the A/D converter. It is known that the pitch frequency of female voices is approximately twice as high as that of male voices, and that the optimal pitch frequency region exists corresponding to format frequencies. Therefore, we tried several multiplying factors for pitch frequencies between 1. 7 and 2. 5 and decided for 2. 1 as the best by an informal listening test. To soften the shrillness of the synthesized voice, we designed a filter to compensate for the difference of the glottal wave forms between female voice and male voice, the input-output relation of which is given by (9). In a standard case, in which the shorter of the rising time and the falling time of the glottal wave form of female voices is twice as long as that of male voice, the difference of the glottal wave forms can be compensated by a filter with the frequency characteristics shown in Fig. 4. The spectra of the vowel segments of the male voice used in this experiment have dips around 2 kHz, which corresponds to the rising time (or falling time) of 0. 5 ms. The optimum value, determined by an informal listening test, for the constant τ of the compensating filter appearing in (10) was also approximately 0. 5 ms. The synthesized voice has excessive amplitude in one part as shown in Fig. 7. To remove this deficiency, a saturating operation was performed on the intensity of the driving signal. By this method we obtained an almost satisfactory female voice without any different processing for each phoneme.
Content from these authors
© 1976 Acoustical Society of Japan
Previous article Next article
feedback
Top