音声スペクトルのローカルピークの静特性のもつ音韻情報に関する検討 : スペクトルのローカルピークを用いた単語音声中の音韻情報の抽出 (その1)

松岡 孝栄; 城戸 健一

doi:10.20697/jasj.32.1_12

Abstract

A recognition system Composed of the following three steps is proposed in our research on the automatic recognition of speech; that is, the first step is the extraction of acoustic parameters, the second is the transformation of the acoustic parameters into a series of features by which the kind of the phoneme of each part of speech is distinguished, and the third is the transformation of the series of features into a string of characters or some symbols which has linguistic meaning as a word or a short sentence. The use of the linguistic information is considered to be effective in the third step. In the first and second steps, the local peaks in the short time spectra analyzed by a filter bank composed of 29 single peak filters of low selectivity are treated as the acoustic parameters. And some experiments on many vowel samples uttered in isolation and in continuation by 31 male adults have been carried out to investigate the effectiveness of the use of the local peaks as the acoustic parameters for the recognition. The usefulness of the local peaks for the discrimination of vowels was verified by experiments. The use of the spectral local peaks is based on a speculation that the local peaks may play a significant role in the processing of speech signal after the frequency analysis by cochlea, and also on an expectation that variation of the features with time may easily be treated by use of the local peaks. The formant frequencies may have, of course, similar properties to those of the spectral local peaks, but it is incredible that the formant frequencies are exactly extracted in the auditory organ. The spectral local peaks are considered to be sufficient for the use in the preprocessor of a speech recognition system by use of linguistic information, such as the use of the words dictionary, according to the results of investigation on the characteristics of the local peaks. The discrimination experiments on vowels and consonants in the names of Japanese twenty cities uttered by 5 male adults, from which the standard patterns for the discrimination of phoneme groups were made by use of the static properties of the spectral local peaks are described in this paper. The speech samples are frequency-analyzed by a filter bank composed of 29 single peak filters of Q≒6. The central frequencies of the filters are taken at intervals of 1/6 octave from 250 Hz to 6300 Hz. Three major spectral local peaks P1, P2 and Pe3 are picked out in every 10ms from the six largest local peaks of the fequency spectrum obtained by analyses with the filter bank by applying two peak processing rules. The frequencies of those local peaks are treated as the acoustical parameters. The set of the acoustic parameters is transformed into a code expressing the phoneme in accordance with the domain on P1-P2 and P2-Pe3 planes on which each set falls. A series of the codes is thus obtained from an utterance. The averaged score of the recognition of vowels was 80%. And the scores of the transformation of the consonant parts into corresponding phoneme groups were more than 80% except for voiced plocives (47%). These scores are not lower than those of the discrimination of speech segments by the human auditory sense. And the results obtained are considered to be sufficient for the use in the preprocessor of a speech recognition system by use of linguistic information.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!