日本音響学会誌

音声研究の現状と将来

藤崎博也

原稿種別: 本文
1978 年 34 巻 3 号 p. 117-121
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_117

ジャーナルフリー

PDF形式でダウンロード (826K)
声帯音源の自励振動モデル

石坂謙三, フラナガンジェームズ L.

原稿種別: 本文
1978 年 34 巻 3 号 p. 122-131
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_122

ジャーナルフリー

抄録を表示する抄録を非表示にする

In earlier work, we derived a dynamic model of vocal-cord vibration, in which a single vocal-cord is described by two mechanical resonators coupled with a stiffness, so-called a two-mass model of the vocal-cords. This simplified model reveals essential features of the self-exciting oscillation mechanism of vocal cord vibration and duplicates the principal features of vocal cord behavior in the human. In the original two-mass model, we estimated that the longitudinal component of vocal cord motion has only secondary influence upon glottal flow and hence upon the self-exciting oscillation mechanism. Therefore, the longitudinal motion has been neglected. We examined the more rigorously earlier estimate. We modifyied the two-mass model for on additional longitudinal motion parallel to the direction of glottal flow. The formulation also involves the rate of air volume displaced by the vibrating masses. Then, computer simulation was carried out on the dynamic twodimensional motion for the vocal cord masses, as shown in Fig. 5. This motion corresponds with the observation made on natural larynxes. According to the results, the longitudinal component of displacement influences oscillation frequency only slightly (less than one Hz) and hardly contributes to the realistic behavior of glottal opening. We therefore conclude that the longitudinal motion is not essential for the realistic self-exciting oscillation of the vocal cords. The dynamic model of vocal cord/vocal tract can generate the synthetic speech with high naturalness. Voice quality and the prosodic features of speech are strongly dependent upon the acoustic properties of glottal excitation sources. The acoustic properties of the glottal flow, U^*_g, and the resulting synthetic vowels /e/ and /a/ are shown in Fig. 6 and 8, respectively. In this self-oscillating model, a conventional assumption of the linear separability of sound source and vocal tract is not made. To indicate the influence of the coupling between them, the acoustic properties of the glottal flow without coupling, U^*_g, and the resulting synthetic vowels are also shown in the figures. Fig. 7 shows the difference in the waveforms of U_g and U^*_g plotted with 10-times enlarged scale. However, the stronger couplinger usually occurs in consonants, in which the constriction in the vocal tract is much smaller than that for vowels, and the vocal-cord behavior can substantially be influenced through the interaction. The vocal-cord model can intrinsically produce the intricacy of natural behavior from relatively simple, physiologically-based parameters ; namely subglottal pressure, rest area of glottal opening, vocal-cord tension, and vocal-tract shapes. Some of these behavior are demonstrated by the examples of the synthesis of vowel-consonant-vowel syllables. Such an example is shown in Fig. 9 for /epa/. Finally, we describe the automatic generation of golttal turbulent noise and glottal stop with the vocal-cord model without any additional control parameter.

抄録全体を表示

PDF形式でダウンロード (1277K)
日本語母音, 子音調音の隣接音の影響による変動

桐谷滋

原稿種別: 本文
1978 年 34 巻 3 号 p. 132-139
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_132

ジャーナルフリー

抄録を表示する抄録を非表示にする

Articulatory movements in the production of the VCV and CVC sequences in Japanese were observed by the x-ray microbeam method, and the contextual variations in the articulations of the middle vowels and consonants were examined. Speech materials examined were the meaningless words of the types /C_1VC_2ae/ (V=i, e, a, o, u ; C_1, C_2=m, t, k, s) and /mV_1CV_2ai/ (V_1, V_2=i, e, a, o, u ; C=m, t, k). It was observed that perturbations of vowel articulations by the consonantal context resulted in a large ovelap in the range of variations of tongue configurations between the vowels /a/ and /o/. The range of variation of the vowel /u/ also partly overlapped with that of /a/ and /o/. However, the differences in the positions of jaw and lip were consistently maintained among these vowels. For the vowels /i/ and /e/, there was consistent difference in the positions of tongue blade. The distinction between the front vowels and the other vowels was always clearly observed. It was also observed that, in the CVC sequences, the perturbations of vowel articulations by the preceding consonants were greater than those by the following consonants, regardless of the types of the vowels. In the VCV sequences, the variation in consonant articulations caused by the following vowels were greater than those by the preceding vowels. This asymmetric effect was clearly observed in case of /k/ but was smaller in other consonants.

抄録全体を表示

PDF形式でダウンロード (839K)
舌の側面輪郭の2次曲線による近似

橋本清, 谷本益巳

原稿種別: 本文
1978 年 34 巻 3 号 p. 140-148
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_140

ジャーナルフリー

抄録を表示する抄録を非表示にする

The representation of tongue shape with a quadratic curve model was studied. The side profile of tongue is given by the coordinate system in Fig. 1. The procedures of fitting a quadratic curve to the data points by the use of the singular decomposition technique are expressed in Eqs. (1), (2), (3), (4), and (5). This is the quadratic curve model of adaptive version, and the most data can be approximated with this model with considerable accuracy. However, with this model, the accuracy declines in case of such consonants as /t/, /d/, /n/, /s/, or /z/. The features can be represented as the distortion of tongue shape from a shape-fixed quadratic curve. Fig. 2 shows the distributions of parameters of an adaptive quadratic curve. The shape-fixed quadratic curve is given as the curve with the average value of the parameters in Fig. 2. The shape-fixed curve is fitted to the data of these consonants by means of variable metric method illustrated in Fig. 4. Fig. 5 shows an example of conversion. Each residual can be calculated as the deformation from the curve. Fig. 6 shows two typical examples of the residuals successively calculated in the respective frames of /t/ and /d/. Fig. 7 shows the eigenvectors and the loci of components in a_1-a_2 space obtained from the principle component analysis of tongue blade residuals. Similar results are obtained from the analysis of tongue root residuals and the synthetic analysis of blade and tongue root residuals, as are shown in Figs. 8 and 9, respectively. From the results, it is seen that the main part of the synthetic residuals of tongue blade and tongue root is taken by the blade residuals. The cross-correlation coefficients of seven representative points of tongue blade to all other points of tongue surface are plotted in Fig. 10, and the canonical correlation coefficients between blade and root are shown in Fig. 11. It is interpreted that these two figures show the deep interrelation between the blade and root of tongue. Next, main component analysis was applied to the tongue surface vector of 14 demensions defined by the coordinate system in Fig. 1, and their eigenvectors are shown in Fig. 12. Fig. 13 shows the loci of the principal components of tongue vector in contrast to the loci of curvature center of the shape-fixed curve. By comparing these two, the underlying relationship between them is obvious. Finally, the accuracy of these tongue models was compared in terms of S/D ratios and contribution rate (%), and the result is shown in Table 1. Fig. 14 shows the typical examples of tongue profiles approximated by various versions. As the of the quadratic curve model. Subject of this research for future from practical point of view the identification of quadratic curves by acoustia analysis remains.

抄録全体を表示

PDF形式でダウンロード (900K)
声道壁のインピーダンスの検討

鈴木誠史

原稿種別: 本文
1978 年 34 巻 3 号 p. 149-156
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_149

ジャーナルフリー

抄録を表示する抄録を非表示にする

It is considered that the wall vibration of vocal tract, or wall impeadance z_w=r_s+jωl_s in an equivalent circuit, contributes to the closed reasonance frequentry of vocal tract F_&ltw0&gt, formant bandwidth B_i and formant frequency F_i (i=1, 2, …). However, the value of z_w has not been fixed yet, and z_w has been frequently simplified or omited in the transformation between the cross-sectional area function of vocal tract and the speech signal. The reason stems from that we had not obtained z_w by direct measurement, and had employed z_w to fit one or two physical features of speech. Table 1 shows z_w proposed or employed in historical works. The purpose of this article is to understand the contribution of z_w to physical feature of speech and to obtain a reasonable z_w by a simple procedure. The speech production system is illustrated in Fig. 1. The vocal tract is divided into acoustic tubes with equal length. Each tube is represented with an equivalent circuit illustrated in Fig. 2 and Eq. (2). In the caluculation of the transfer function of vocal tract from the equivalent circuit, it is assumed that the glottis is closed. F_i is determined from the frequency whose phase is alternated. Seventeen area function are prepared for the estimation of F_&lti, p&gt and B_&lti, p&gt (p=1 or 3. 9atm). r_s is varied from 0 to 10000, and l_s from 0. 2 to 3. 8. Fig. 3 shows the relation between z_w and F_&ltw0&gt calculated as a resonance frequency of Helmholtz resonator. The relation between B_&lt1, 1&gt and z_w of a uniform tube is shown in Fig. 4. Fig. 5 shows the same relation at 3. 9 atm. dF defined by Eq. (4) means the upward shift rate of the first formant frequency referring to the lossless vocal tract. Fig. 6 shows the relation between dF and z_w of three area functions. F_&lti, p&gt, the formant frequency of speech uttered under p atm, supward transposed in comparison with F_&lti, 1&gt, that in the normal air, and it is represented in Eq. (5). This equation is well fit with the experimental result by the authors, and the experiment shows that F_&ltw0&gt is 195Hz. △F, the frequency difference of F_&lt1, p&gt and F_&lt1, 1&gt, is caluculated for a uniform tube having various z_w and shown in Fig. 7. The range of r_s and l_s can be speculated in consideration of Figs. 3, 4, 5 and 7. But it is almost impossible to determine the reasonable z_w. F_&lt1, 3. 9&gt calculated by Eq. (5) with F_&ltw0&gt=195Hz is compared with F_&lt1, 3. 9&gt calculated from seventeen area functions with various z_w. The difference is estimated by the mean square error, and z_w=1400+jω1. 6 shows the least error (this z_w is called z_&ltws&gt hereinafter). Fig. 8 shows the relation between F_&lt1, 3. 9&gt and F_&lt1, 1&gt calculated by Eq. (5) and from area functions with z_&ltws&gt. z_&ltws&gt is comparatively close to z_w measured directlyby Ishizaka et al (see Table 1). On the other hand, Sweep tone method shows that F_&ltw0&gt is in the range between 150 and 200Hz. Applying z_&ltw0&gt to the area function drawn in Fig. 3, F_&ltw0&gt becomes 177Hz. B_i calculated from seventeen area functions with z_&ltws&gt are shown in Fig. 10. B_1 in this figure well fit with the bandwidth obtained by Sweep tone method. Table 3 shows F_i and B_i(i=1, 2, 3)calculated from area functions drawn in Fig. 10. Two kinds of z_w are employed in this calculation. It indicates that if historical z_w=6500+jω0. 4 is used, low F_1 and wide B_1 are obtained. Fig. 11 is the formant pattern as a function of the position of constriction in a uiform tube with two kinds of z_w. This figure suggests that the inadequate z_w will give the wrong place of articulation in the transformation from speech signal to area function. It is conclusively said that
(View PDF for the rest of the abstract.)

抄録全体を表示

PDF形式でダウンロード (797K)
デコンボルーションによる声道形の推定と適応型音声分析システム

中島隆之, 鈴木虎三, 大村浩, 石崎俊, 田中和世

原稿種別: 本文
1978 年 34 巻 3 号 p. 157-166
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_157

ジャーナルフリー

抄録を表示する抄録を非表示にする

Disregarding the nasal tract, the vocal organ in speech production is regarded as a tube passing from lungs to the lip (see Fig. 1). From the assumption that the most remarkable loss effect appears in the glottal portion, the total vocal tract loss is represented by means of a no-loss, infinitely long, uniform acoustic tube below the glottis. The speech production process in the vocal tract can be simulated by the Kelly's ladder-form circuit, as shown in Fig. 2 (a). According to Itakura (1971) and Wakita (1972), it is shown that the partial autocorrelation coefficients k are extracted with the self-control system shown in Fig. 3 (a). Figure 2 (a) can be transformed to the equivalent circuit shown in Fig. 3 (b), neglecting the loss near the lip portion (r_0→-1). Comparing Fig. 3 (a) with Fig. 3 (b), it is clear that the k parameter extraction process corresponds formally to the inverse tracing of speceh production process. To ascertain the relation, the synthesized speech generated with a given vocal tract shape and impulse train excitation as the vocal source was analyzed. Matching partial-autocorrelation coefficients to the reflection coefficients r_i, (i=1, 2, …) from the lip side, the reflection coefficients are converted to area functions as shown in Fig. 4. From the experiments, it was concluded that the vocal tract shape can be perfectly estimated by this method, except when the vocal tract resonance is quite sharp as compared with actual speech (that is, when the loss at the glottis is extremely small). The next problem is how to separate the vocal tract impulse response from speech waves. Two hypotheses were developed for the separation. One is that, since the gross frequency transmission characteristics of the vocal tract are flat, the gross speech spectrum gradient and bending are based on the glottal wave and radiation characteristics. The second hypothesis is that the power spectrum of the glottal wave, including radiation characteristics, is smooth and has no sharp resonance. Figure 6 is a proposed inverse model of vocal cord wave (with radiation characteristics) model, including unknown parameters ε_i (i=1, …, 5)(Nakajima and Suzuki, 1976). The unkown parameters of this model are estimated from speech waves by the following technique. As an example, the parameter in the 2nd-order critical damping system corresponding to the inverse of the first stage in Fig. 6 is calculated from 1st and 2nd delayed autocorrelation coefficients of the speech wave (ref. Eq. 1〜4). When the power spectrum of sound source and radiation characteristics is expressed with this model, the vacal tract impulse response is extracted by inverse filtering of the estimated vocal cord wave model, and the gross power spectrum is assured to be flat. At this time, pole frequency and band width are not affected. The principle of this method is illustrared in Fig. 5. Experimental results on natural speech by an adult man and a child are shown in Fig. 7 and 8, respectively. In the section 5, an adaptive speech analysis system is described, which selects automatically the suitable speech analysis methods, on the basis of the decision of voiced/unvoiced/plosive sounds with the input speech wave. Vocal tract shape is estimated in case of voiced sounds. In the case of unvoiced sounds, the acoustic tube shape equivalent to the power spectrum of L. P. C. analysis is obtained. In the plosive sounds, shorter analysis window and frame interval than usual are used for the analysis. Finally, examples of analysed results are illustrated (see Fig. 10). It is shown that the system is useful for the observation of speech from both sides of power spectrum and articulatory domain, and the obtained pattern is useful for automatic speech recognition.

抄録全体を表示

PDF形式でダウンロード (1175K)
近畿方言2拍単語アクセント型の分析及び知覚

藤崎博也, 杉藤美代子

原稿種別: 本文
1978 年 34 巻 3 号 p. 167-176
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_167

ジャーナルフリー

抄録を表示する抄録を非表示にする

For the purpose of elucidating the relationship between the word accent types and the contours of fundamental frequency (F_0-contour), a model has been presented by one of the authors for the process of generating an F_0-contour from "voicing" and "accent" commands, and has been applied to analyze the F_0-contours of word accent types in the Tokyo dialect. The present study was conducted to test the model's validity for the Kinki dialect, which posesses peculiar accent types not found in the Tokyo dialect (Table 1), and also to examine the perceptual significance of parameters of the model. The speech materials were the utterances of two-mora [ame] pronounced in all four accent types of the Osaka dialect (Table 2) by a male informant. Extraction of F_0-contours (Fig. 1) and their parameters were conducted with a digital computer. Using a functional model for generating the F_0-contour(Figs. 2 and 3), parameters were extracted from six utterances each of the accent types, by finding the best match between the observed and generated F_0-contours (Table 3). The close agreement between the observed and generated contours proved the model's validity for the Kinki dialect (Fig. 4). While the magnitude and rate of responses to voicing and accent command are considered to characterize the laryngeal functions of a speaker, the timing parameters of the accent command, i. e. the onset and the end, are found to be specific to each accent type, and can clearly separate the four accent types (Fig. 5). The perceptual relevance of these timing parameters was examined by the identifications tests of accent types using 40 synthetic speech stimuli consisting of both typical stimuli of the four accent types and intermediate stimuli, generated by systematicallyvarying the timing parameters of the accent command. The subjects were 10 speakers of the Osaka dialect and two speakers of the Tokyo dialect. The perceptual boundary between two accent types was determined for each subject (Fig. 6), which was quite clear-cut and almost agreed in all the subjects (Fig. 7), indicating the perceptual importance of these timing parameters in the identification of accent types. Further experiments using stimuli with systematic shifts in the timing of formant frequency patterns indicated that the relative timing of the accent command and the segmental features of a particular phoneme is quite important for the identification of a specific accent type (Figs. 8 and 9), but not necessarily for other types. These results indicate that the perception of word accent requires specification of certain features for temporal units which are smaller than mora, which is commonly accepted as the suprasegmental unit of spoken Japanese.

抄録全体を表示

PDF形式でダウンロード (1210K)
ホルマント周波数上での調音結合の定式化と音声自動認識への適用

佐藤泰雄, 藤崎博也

原稿種別: 本文
1978 年 34 巻 3 号 p. 177-185
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_177

ジャーナルフリー

抄録を表示する抄録を非表示にする

In order to realize the reliable automatic recognition of phonemes in connected sppech, effective means are required to cope with the variations in their acoustic characteristic due to the idiosyncrasy of speakers and coarticulation. This paper describes a new scheme for carrying out the segmentation and recognition of connected vowels and semivowels, based on a speakeradaptive model of the coarticulatory process. The process of coarticulation between the adjoining phonemes in connected vowels can be modeled in the domain of formant frequencies by a smoothing system which converts the stepwise varying target value corresponding to each successive vowels into the actual formant trajectory (Fig. 1). As the characteristics of this system, those of a critically-damped second-order linear system are generally valid as shown by the example of the word /ie/ (Fig. 2), but further elaborations, taking the continuity and coupling of reasonance modes into consideration, are required in case of the combinations of front and back vowels, as shown by the example of the word /ai/ (Fig. 3). As the input, the proposed scheme (Fig. 4) uses the trajectories of the first three formant frequencies, extrated pitch-synchronously from the short-term frequency spectra of speech, but converted to the sample values at uniform intervals by interpolation. Since highly accurate recognition of initial vowels is possible by the established techniques for the recoanition of sustained vowels, their formant frequencies can be used to estimate the target values of other vowels of the same speaker. The estimation is based on the average relationships found among the formant frequencies of all five vowels of many speakers, and by this stimation, the coarticulatory model can be adapted to an arbitrary speaker. The model can then be used for determining the underlying targets from observed formant trajectories by the method of analysis-by-synthesis, thereby accomplishing successive segmention and recognition of each phoneme in connected vowels. The validity of the scheme was proved by having obtained the overall rate of correct recognition of 98. 7% (Table 1) for a total of 445 utterances consisting of vowel dyads, triads, and quadruplets by three male speakers. The scheme can be extended to the recognition of semivowels. It has been found that formant targets of the semivowels /j/ and /w/ are quite close to these of the vowels /i/ and /u/, respectively, but their command durations are significantly different (Fig. 7). The utilization of the speech rate information, represented by the command duration of the immediately following vowel, is necessary for the accurate separation of /j/, /i/, and /ij/, when the speech rate varies over a wide range (Fig. 8). If the speech rate information is given, the rate of correct recognition of these categories is 97. 5% for a total of 270 utterances of 15 words containig semivowels, vowels, and vowel-semivowel combinations in the same context.

抄録全体を表示

PDF形式でダウンロード (1158K)
音声スペクトルの概略形とその動特性を利用した単語音声認識システム

三輪譲二, 新津善弘, 牧野正三, 城戸健一

原稿種別: 本文
1978 年 34 巻 3 号 p. 186-193
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_186

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper describes on the outline and the perfomance of the newly developed spoken word recognition system. In the system, the spectral local peaks and the gross parameters of speech spectrum are utilized for the phoneme recognition, and the word dictionary written in phonemic symbols is used for the last step of the word recognition. The uses of the spectral local peaks and the word dictionary are based on the previously proposed ideas [1] and experiments [2], and the use of the gross parameters of spectrum is newly added for improving the segmentation and the phoneme recognition. The schematic diagram of the system is shown in Fig. 1. The input spoken word is, first, frequencyanalyzed by the 29 channel filter bank composed of single tuned filters of Q=6, of which the center frequencies are arranged every 1/6 octave from 250Hz to 6300Hz. A least squares fit line is computed from the logarithmic analyzed spectrum every 10 ms. Then the modified spectrum is computed, which is the difference between the analyzed spectrum and the fit line. By using the fit line, the difference in the slope of speech spectrum caused by the individuality of speakers can be neglected. From the modified speech spectrum, three major local peaks are extracted, and the newly defined acoustical parameters V, G and H are computed, which express the gross pattern of the spectrum. The power W is also computed from the original analysed spectrum. The smoothed parameters of W, V, G and H are named W_s, V_s, G_s and H_s. Consonant segments are extracted by using the dynamic characteristics of V_s, W_s and V from speech. In the consonant segments, consonants or consonant groups are recognized by using the peaks and the power. Nasal is recognized by using the peaks independent of the consonant segments. Semivowels are recognized by using the dynamic characteristic of G_s, and H_s, and the peaks. If some phonemes are recognized in the same segment, only one phoneme is recognized according to priority. Vowels are recognized in the remaining segments by using the peaks. Fig. 4 shows an example of phoneme recognition. The phonemic sequence is constructed from the results of the phoneme recongnition. Some errors in the sequence are corrected by using the phoneme connecting rules. The similarity of the phonemic sequence to every items of the word dictionary in the system is computed by considering the probabilities of the additions, the omissions and erroneous phonemes with the algorithm of the dynamic programming. The item of the dictionary having maximum similarity to the sequences is chosen as the output of the word recognition. Some recognition experiments were carried out with the system. In the experiments, one item of the word dictionary corresponded to one word, and was written in the orthographical form which was easily converted from Japanese Kana letters by some rules. Table 4 shows the scores in the experiments. The score of the word recognition was found to be 83. 2% for 166 city names uttered by 15 male speakers. In the experiments using small number of words, the scores were found to be 93. 6% and 96. 3% for 51 and 20 city names, respectively.

抄録全体を表示

PDF形式でダウンロード (941K)
音声によるオンライン質問回答システム

好田正紀, 中津良平, 鹿野清宏, 伊藤憲三

原稿種別: 本文
1978 年 34 巻 3 号 p. 194-203
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_194

ジャーナルフリー

抄録を表示する抄録を非表示にする

Recently, the research of Speech Understanding System (SUS) has attracted great interest as a new approach to continuous speech recognition. The features of the concept of SUS are the following three points. (1) The contents of conversation are restricted to some defined area. (2) Emphasis is placed on understanding the meanings and contents of input speech rather than recognizing each word or phrase. (3) The recognition of input speech is performed through question-answering between a computer and a user. This paper describes on the contents of the SUS which the authors have studied from 1974 to 1976 and which can operate in on-line mode. The task to be performed with the system is the reservation service of train seats, and 28 stations and 181 trains are treated. Table 2 shows the seven items of reservation. The vocabulary of input speech consists of 112 words. The system consists of three parts as shown in Fig. 1. They are the acoustic processor, the linguistic processor and the audio response unit. Figure 2 illustrates the computer system on which the question-answer system in implemented. The acoustic processor and the audio response unit are implemented on NEAC 3200/70, and the linguistic processor on PF U-400. The use of high-speed speech processors connected to NEAC 3200/70 and the high-speed data transmission between these computers makes the one-line processing possible. The detailed construction of the system is shown in Fig. 3. In the acoustic processor, the feature extraction and the phoneme recognition are executed, and the results of treatment are represented in the form of phoneme lattice. In the linguistic processor, the meanings and contents of input speech are grasped through the word recognition, the syntatic analysis and the inference. Then corresponding to the recognition results, the sentences for response are composed. The audio response unit synthesizes these sentences as the response to the user. Input speech to the system must have short pauses fo more than 0. 5sec between adjacent phrases. But except this constraint, a user may speak freely to the system without being restricted by the order of reservation items or the grammar. A model of conversation was prepared so that a computer and a user can make smooth and natural question-answering. Table 3 shows the seven states in the conversation model, for each of which particular response sentences are prepared. Figure 4 shows the transition among these states. The inference by the use of time table is executed during the transition among states, which is useful to reduce the number of question-answering cycles. The output speech from the system is synthesized using words or phrases as units. For this purpose, 23 kinds of sentence patterns and 460 kinds of words or phrases to be inserted into these sentences are prepared. The performance of the system was tested by on-line question-answering experiments. Eight male speakers tried to make 320 kinds of seat reservations in total (40 reservations for each speaker), and 99. 1% of all the reservations were successfully completed. The average number of question-answering cycles, excluding the first input, was 3. 21 to complete the reservations. The detailed analysis of the contents of the question-answer is shown in Table 5, which reveals that the number of times of reinput due to rejection or misrecognition was small. These results show that the system operates fairly well in the on-line question-answering mode. The average time for acoustic and linguistic processing is 5. 0 times as much as the real-time. Figure 6 shows an example of the time chart of processing.

抄録全体を表示

PDF形式でダウンロード (1055K)
日本語文章の音声認識システム

関口芳廣, 重永実

原稿種別: 本文
1978 年 34 巻 3 号 p. 204-213
発行日: 1978/03/01
公開日: 2017/06/02

DOIhttps://doi.org/10.20697/jasj.34.3_204

ジャーナルフリー

抄録を表示する抄録を非表示にする

We have constructed a continuous speech recognition system for various kinds of Japanese sentences. We explain the procedure referring to the flow diagram shown in Fig. 1. (1)The parameters as shown in Fig. 2 are extracted from the input speech waves, then this input speech waves are transformed into a phoneme string. (2)This input phoneme string is transformed into a condensed phoneme string (Fig. 3). (3)The characteristic phoneme string, in which vowels and /s/ continuing over 50 ms and silence are contained, is extracted from the input phoneme string(Fig. 3). (4)Candidate words are predicted by syntactic and semantic informations. (5)Furthermore, candidate words are restricted by a few phoneme at the beginning of the condensed phoneme string. (6)The input characteristic phoneme string is compared with the characteristic phoneme string of candidate words, and some words are selected. (7)The input condensed phoneme string is compared with the items of the word dictionary of candidate words, and some words are selected. Table 2 shows some items in the two dictionaries. (8)In this way, some words following the preceding word are selected and so on. Thus several word strings are formed(Fig. 4). (9), (10)Above procedures are repeated until all input phonemes are precessed. (11)Each candidate word string is compared with the input condensed phoneme string. (12)As the final output, the word string having the highest reliability is taken. (13)Pragmatic analysis is carried out with the output word string, and the subject concerned now is decided. (14)Then the words unrelated to the subject are removed from the dictionaries. The vocabulary contains 99 words, and it is possible to deal with the sentences concerning to both statistics and landscape. Japanese sentences as shown in Fig. 8 were spoken by four adult males. The results are shown in Fig. 9 and Table 3. This system can recognize 28 sentences among 36, and 76 blocks among 86. (We call a part of a sentence uttered in a breath a block. ) In this paper, we discuss some problems concerning to the system performances. The results are as follows: (1) Some learning process is necessary in order to satisfactorily identify vowels and nasals. (2) Acoustic informations are very effective to restrict the number of candidate words. For example, at the beginning of sentences 91 words are reduced usually to 5 (5. 5%). (3) The use of syntactic and semantic informations reduces the number of candidate words to 20-30% of the ones appearing in case of using only the acoustic informations (Fig. 11). (4) The restriction of the word dictionary by pragmatic informations is also very effective (Table 5). (5) Misrecognition is mostly due to the appearance of undesirable silence owing to decrease of the amplitude of speech waves. The advantages of this system are as follows. (1) Phoneme identification is fairly reliable, and the recognition score of short blocks is very good. (2) Even when the input phoneme string has some errors, the recognition score is much better than that by the matching method using dynamic programing. (3) Many kinds of Japanese sentence may be dealt with in this system, because syntactic restriction is loose. (4) So far, the semantic informations used are simple, but semantically unreasonable sentences may seldom appear. (5) Candidate words are sufficiently restricted by acoustic and linguistic informations.

抄録全体を表示

PDF形式でダウンロード (1155K)

J-STAGEへの登録はこちら（無料）