In vowel perception, the so-called "context effect" is caused by the existence of short-term memory (STM) of the preceding and following vowels. That is, in the case of identifing a vowel in a context, shift of the phoneme boundary occurs because of the contrast effect caused by STM. The purpose of this paper is to investigate how the boundary shift is influenced by the difference in the context of a vowel. Here, the phoneme boundary is quantitatively determined by performing psychological experiments using V_1CV_2 and V_1V_2 syllables. The results obtained are summarized as follows: (1) STM of the preceding vowel gradually decays with time. Accordingly, the contrast effect on the following vowel decreases. That is, the boundary shift becomes smaller with the decay of STM. Thus, the retention curve of STM was obtained estimating the amount of boundary shift of V_2 influenced by V_1 in identifing V_1V_2 (Fig. 2). The individual difference of STM appears in the retention curve obtained from the stimulus continuum. (2) In case of identifing V_1bV_2, the phoneme boundary of V_2 shifts temporarily towards V_1 because of the contrast effect caused by STM of V_1 (Fig. 6). In addition, the boundary shift depends on the difference in the quality of V_1. The amount of boundary shift increases monotonously as V_1 is displaced from the phoneme boundary. However, the amount decreases or saturates when V_1 comes to the range of another vowel (Fig. 7). Furthermore, in order to investigate the two-dimensional boundary shift, the phoneme boundaries on /e/-/a/-/u/ ranges were represented in F_1-F_2 space. When V_1 is /e/ and V_2 is the vowel in the vicinity of the boundary of /u/, the contrast effect mainly occurs between /e/ and /a/. Consequently, the boundary of /u/ shifts towards the low F_1(Fig. 8). From the viewpoint of the distinctive feature, /e/ is more similar to /a/ than to /u/. The acute of the distinctive feature is extracted form V_1 and V_2. Thus, the acute of V_1 is contrasted with that of V_2. (3) The effect on the difference in the position of vowel was investigated using V_1bV_2 and V_2bV_1 syllables. The boundary shift of V_2 in V_2bV_1 is more marked than that in V_1bV_2(Fig. 9). That is, the effect on V_1 in V_2bV_1 is more remarkable than that in V_1bV_2. (4) The effect of C in identifing V_1CV_2 tends to reduce the contrast effect occuring between V_1 and V_2, if C bears the distinctive feature relating to the contrast between V_1 and V_2 (Table 2).
A recognition system Composed of the following three steps is proposed in our research on the automatic recognition of speech; that is, the first step is the extraction of acoustic parameters, the second is the transformation of the acoustic parameters into a series of features by which the kind of the phoneme of each part of speech is distinguished, and the third is the transformation of the series of features into a string of characters or some symbols which has linguistic meaning as a word or a short sentence. The use of the linguistic information is considered to be effective in the third step. In the first and second steps, the local peaks in the short time spectra analyzed by a filter bank composed of 29 single peak filters of low selectivity are treated as the acoustic parameters. And some experiments on many vowel samples uttered in isolation and in continuation by 31 male adults have been carried out to investigate the effectiveness of the use of the local peaks as the acoustic parameters for the recognition. The usefulness of the local peaks for the discrimination of vowels was verified by experiments. The use of the spectral local peaks is based on a speculation that the local peaks may play a significant role in the processing of speech signal after the frequency analysis by cochlea, and also on an expectation that variation of the features with time may easily be treated by use of the local peaks. The formant frequencies may have, of course, similar properties to those of the spectral local peaks, but it is incredible that the formant frequencies are exactly extracted in the auditory organ. The spectral local peaks are considered to be sufficient for the use in the preprocessor of a speech recognition system by use of linguistic information, such as the use of the words dictionary, according to the results of investigation on the characteristics of the local peaks. The discrimination experiments on vowels and consonants in the names of Japanese twenty cities uttered by 5 male adults, from which the standard patterns for the discrimination of phoneme groups were made by use of the static properties of the spectral local peaks are described in this paper. The speech samples are frequency-analyzed by a filter bank composed of 29 single peak filters of Q≒6. The central frequencies of the filters are taken at intervals of 1/6 octave from 250 Hz to 6300 Hz. Three major spectral local peaks P1, P2 and Pe3 are picked out in every 10ms from the six largest local peaks of the fequency spectrum obtained by analyses with the filter bank by applying two peak processing rules. The frequencies of those local peaks are treated as the acoustical parameters. The set of the acoustic parameters is transformed into a code expressing the phoneme in accordance with the domain on P1-P2 and P2-Pe3 planes on which each set falls. A series of the codes is thus obtained from an utterance. The averaged score of the recognition of vowels was 80%. And the scores of the transformation of the consonant parts into corresponding phoneme groups were more than 80% except for voiced plocives (47%). These scores are not lower than those of the discrimination of speech segments by the human auditory sense. And the results obtained are considered to be sufficient for the use in the preprocessor of a speech recognition system by use of linguistic information.