A spoken-word recognition system composed of the following three steps has been in this research. That is, the first step is the extraction of the acoustic parameters, the second is the transformation of the acoustic parameters into a string of features, and the third is the transformation of the string of the features into a string of characters or some symbols which represents a word or short sentence. The use of the linguistic information is considered to be effective on the third step. On the first two steps, the local peaks in the short time spectra analyzed by a filter bank composed of 29 single peak filters of low selectivity are treated as the acoustic parameters. And some experiments on vowel samples uttered in isolation and in continuation by 31 male adults were carried out to investigate the effectiveness of the use of the local peaks. The usefulness of the local peaks for the recognition was proved by experiments. And the discrimination experiments on vowels and consonants in Japanese 20 city names uttered by 5 male adults by use of the static properties of the spectral local peaks were carried out. The scores of the discrimination were more than 80% expect for voiced stops (47%) and for some phonemes described in this paper. For the semivowels, liquid, unvoiced fricative/h/, stop consonants and choked sound, the dynamic property has an important part for the transformation of the speech segments into the phonemic symbols. Then, the discrimination experiments on the phonemes by use of the changes in local peaks and the variation in the total power of speech segments with time have been carried out as described in this paper. The speech samples are frequency-analyzed by a filter bank composed of 29 single peak filters as Q≒6. The center frequencies of the filters are every 1/6 octave from 250 Hz to 6300 Hz. Three major spectral local peaks, P1, P2 and Pe3 are picked up in every 10 ms from the six largest local peaks of the frequency spectrum obtained by analysis with the filter bank by applying two-peak processing rules. The phonemes were discriminated use of the changes in these local peaks and the variation in the total power of speech segments with time. From the speech samples of 20 city names uttered by 5 male adults, the standard patterns for the discrimination of phonemes were made. The scores of the discriminations of phonemes in the speech samples were as follows; /w/:65%, /j/:80%, /r/:68%, /h/:60% (in the initial position of words) and 87% (in the other position of words), /p, t, k/:97% and /Q/:100%. By using the above-mentioned standard patterns, a discrimination experiment was carried out with other 146 city names and the following scores were obtained; /w/:42%, /j/:71%, /r/:74% and /h/:27% (in the initial position of words) and 88%(in the other position of words). These results give us the expectation for the effectiveness of this method of feature extraction and the transformation into the phonemic symbols in the speech recognition system. And some recognition experiments were carried out. The 20 city names from which the standard patterns were made were used for the first time, and 96% of 100 samples were correctly recognized. The 20 city names uttered by other 3 male adults were used for the second experiment, and 86% of 60 samples were correctly recognized. The recognition score is considered to be increased by the improvement in the linguistic processing in the recognition system.
抄録全体を表示