Most of the schemes for recognition of connected speech are based on segmentation of the speech signal into separate units and their subsequent recognition as individual syllables. In view of the face that the acoustical properties of these units are quite different from those of monosyllables uttered in isolation, however, it is to be expected that these units, when taken out from connected speech and presented in isolation, are perceptually different from the corresponding isolated monosyllables. If such perceptual differences are shown to exist, they may serve as improtant evidences for the necessity of incorporating systematic removal of contextual variations into schemes for automatic recognition of connected speech. From this point of view, investigations have been made on the perceptual properties of vowels segmented from isolated monosyllables and connected speech, as well as those of monosyllables and larger units segmented from connected speech, and the following results were obtained: 1. In the case of an isolated C-V syllable, the perception of the consonant was found bo be affrected by the systematic removal of the initial portion of the syllable, unitil finally only the vowel was perceived. The perception of the vowel, however, remained unaffected (Figs. 2 and 3). 2. On the other hand, the perception of a vowel in connected speech was found to be seriously impaired by the removal of its environment. In the case of the vowel/a/, for example, only 12 out of 32 samples in clearly pronounced connected speech received more than 50% correct judgment when taken out from their environments and presented in isolation. The average score of correct identification for the 32 samples was only 57% (Figs. 4 and 5). The scores for the vowels /i/, /e/, /o/ and /u/ ranged from 52% to 70%. These perceptual confusions of vowels in connected speech were found highly correlated with their acoustical properties in the F_1- F_2 plane (Fig. 6). 3. When monosyllabic segments were taken out from connected sppeech and presented in isolation, only 92 out of 219 samples were identified correctly, corresponding to a score as low as 42%. Because of the existence of the preceding consonantal environment, however, the score for the vowels in this case was improved up to about 80%. In the case of bisyllabic segments, the scores of correct identification for the first and the second syllables were 62% and 76% respectively (Fig. 7). In the case of trisyllabic segments, the score for the middle syllable was further improved to 95%, and the score for the middle vowel was as high as 97% (Figs. 8, 9, 10 and 11). These experimental results indicate that the perception of vowels or monosyllables in connected speech is seriously impaired by the complete removal of their environments, and that at least two syllables, one preceding and one following, are necessary to provide a perceptual environment for their correct identification.
抄録全体を表示