日本音響学会誌
Online ISSN : 2432-2040
Print ISSN : 0369-4232
声の韻質と声質 : 音響的声道模型による音声の合成
梅田 規子寺西 立年
著者情報
ジャーナル フリー

1966 年 22 巻 4 号 p. 195-203

詳細
抄録

This paper describes a simple device which simulates a human vocal tract acoustically, and results obtained from systhesized speech sounds produced by the device. It is simple and easy to deal with as compared with an electrical vocal tract simulator. Acoustic models of vocal tracts are made of transparent acryl-resin. They are of box shape. The vocal tract length of a man's model is 17. 5cm, that of woman's model 14cm, about 80% of that of the man's and that of children's models about 11cm and 9cm. The height is 2. 5cm. The cross-sectional areas of these models are made variable by moving 1-cm thick plastic strips which are closely inserted from one side. They have a nasal branch as well. Glottal sounds are sent into one end (glottis) of the models and let out of the other end (mouth). Various vowels and other sustained sounds are produced accrding to the configuration of the models at that time. The driving unit of a horn speaker (NEC-555M, Japan) was used as a sound source. Considering that the acoustic impedance at the human glottis is very high, a bundle of steel wires, each 1. 5mm in diameter and 14mm in length was packed tightly into the throat of the loud speaker. Consequently, the cross-sectional area of the throat is about 1. 3cm^2. By observing sustained seech sounds, we find two features in them. One is phonemic feature, in other words a feature that distinguishes one phoneme from others, and the other is a feature that contributes to naturalness, in other words, a feature that distinguishes not only males, females and children but also individuals from one another. We have successfully made these features clear physically. When the length of a vocal tract is reduced gradually from 17. 5cm, it will be seen that the configuration of phoneme is reduced similarly without spoiling the phonemic feature. As for the cross-sectional areas, relative values are only required. So we can normalize the vocal tract configuration of every phoneme with respect to the vocal tract length and the cross-sectional areas. The use of the normalized configurations will afford us normalized spectra of phonemes. The relation between the vocal tract length and the fundamental frequency of voice, which serves to distinguish speakers from one another in sex and age, can also be nomalized. That is to say, if the ratio of frequency, the wavelength of which is four times as long as a tract to the fundamental frequency of voice, is called a normalized pitch, we can obtain natural synthesized speech sounds when the normalized pitch ranges from 2. 5 to 5. 0. The waveform of glottal sounds referred to herein is saw-tooth form. The decay time T_3(msec) of the saw-tooth-wave form being sufficiently short as compared with one cycle T, the first zero point of glottal spectrum will apper at 1/(T_3)(kc). Glottal sound spectrum has a great influence upon characteristics of each individual's voice. The shorter the decay times, the sharper a voice becomes. Longer decay time (about 0. 6-1. 0 msec) is better for a female voice while shorter one (about 0. 2-0. 5 msec) is better for a child's voice.

著者関連情報
© 1966 一般社団法人 日本音響学会
次の記事
feedback
Top