辞書と音形規則を利用した単語音声の認識

板橋 秀一; 城戸 健一

doi:10.20697/jasj.27.9_473

抄録

Speech is not merely a physical phenomenon but also one of the forms to express the linguistic event. Therfore it is natural and necessary for the automatic speech recognition to take into account the linguistic aspect of speech. Linguistic information will be given by the meaning, the grammer, the dictionary, the connecting rule of phonological units and so on. Former two have not yet been studied enough and so they can not be utilized for the automatic speech recognition. So, it seems reasonable to limit our present object of study to the automatic recognition of spoken words. From this standpoint, the authors have carried out the study on automatic spoken word recognition system which utilized some of the linguistic rules and the dictionary as shown if Figs. 3 and 4. Speech signal is digitally filtered into four frequency bands at each 10 m. s. These bands have been determined considering formants of vowels or nasals and noise components of consonants. The logarithm of the variance of output of the band M_1, and LT, M_1L etc. in Fig. 1 are used as parameters, which are them transformed into distinctive features. Let X^k_i={X^k_(ir)}^9_(r=1) denote the parameters obtained at each 10 m. s. which should be categorized as the feature plus(k=+) or minus(k=-), where i indicates the material number(i=1〜n)and r represents each of nine parameters. Nine distinctive features are represented by the linear combinations of these parameters such as F(X^k_i)=��^9_(r=1)C_rX^k_(ir). These coefficients are determined so as to maximize the ratio of the variance between two classes {F(X^+_i)} and {F(X^-_i)} to the sum of variances within each class. Phonemes are classified into two groups according to the sign of nine distinctive features as shown in Tab. 1. Average error rate of feature extraction is 10. 5% with 13 words (7 seconds of speech) spoken by a male talker. The series of values of nine distinctive features is segmented primarily with reference to a certain amount of change in feature value, and secoundly, they are segmented by applying the rules which depend of the result of primary segmentation, context, duration of the segment and phoneme connection rules. The input feature matrix is made from the representative features of each segment. On the other hand, an item of the dictionary of 54 words which is represented as a series of phonemes is transformed into a series of features, which then is transformed into a standard feature matrix by applying the phonological rule such as the devocalization. The distance between input and standard feature matrices is calculated for each item of the dictionary and the item of minimum distance from the input is taken as a recognized output(see Fig. 3). According to our experiments, the recognition rate is 42. 0% only with the segmentation rule, 59. 5% with segmentation and phoneme connection rule and 92. 3% with the dictionary in addition to those rules for 13 words spoken by a male talker. 79. 2% of 53 words spoken by the same talker are recognized correctly. Next, we examined the performance of the recognition system equipped with a duration dictionary which contains the typical duration of phonemes in each word(see Fig. 4). The segmentation is performed according to the item of the duration dictionary;the item of minimum distance from the input feature matrix is taken as a recognized output. 92. 3% of 52 words uttered by the same talker(as the one mentioned above)for the standard duration is recognized correctly. Average recognition rate of 10 words spoken by each of another nine male talkers is 70. 0%. The effectiveness of utilization of a word dictionary and some of the linguistic rules to the automatic spoken word recognition is made clear.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）