日本音響学会誌
Online ISSN : 2432-2040
Print ISSN : 0369-4232
27 巻 , 9 号
選択された号の論文の9件中1~9を表示しています
  • 藤崎 博也
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 421-424
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
  • 沢島 政行
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 425-434
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    For viewing the articulatory movements of speech organs inside the body, we have developed a new technique by use the specially designed fiberscope. There are three types of fiberscopes, the standard model, the thinner model, and the wide-angle model. The former two models were designed for observing the larynx, and the third mainly for observing the pharynx and the velum. All of them are inserted through the nasal passage in order to secure free movements of the articulatory organs during observation. Fiberscopes are basically composed of the image guide, the light guide, the objective lens, and the eye piece(Fig. 1). The image guide is a bundle of aligned("coherent")glass fibers and it transmits the image from one end coupled to the objective lens, to the other end coupled to the eye piece, while the light guide conducts the light for illumination from a light source to the object. The two bundles form a flexible cable of the scope. The diameter of each glass fiber is 9 microns for the image guide and 22 microns for the light guide. In a control unit there is an angle lever to which thin wire is attached and runs to the tip of the flexible cable for the remote bending control of the tip portion. A cine-camera can be attached to the eye piece by means of an adapter. The standard mode, animproved version of the model we first reported in 1968, has an outside diameter of 5. 5 mm at the tip. The objective lens gives an image field angle of 44 degrees, and the object to lens distance ranges from 15 to 50 mm. A light source of 300 W xenon lamp gives sufficient illuminations for the glottis for motion pictures at a rate of, for example, 64 frames per sec. , giving an image size of 6×6 mm^2 on the film. Photographic emulsion of ASA 500 is used. The thinner model which was more recently designed has an outside diameter of 4. 4 mm at the tip. The image size on the film is approximately 4×4 mm^2 when the same adapter as in the standard model is used. In respect to the image resolution, it is somewhat inferior to the standard model. The wide-angle model has an outside diameter comparable with the standard one, the objective lens giving a field angle of 65 degrees and a lens to object distance range of 7 mm to infinity. The image size on the film is designed to be the same as the thinner model. Before the insertion of the scope, a surface anesthesia is applied to the nasal cavity and the epipharynx. Positioning of the scope(Fig. 2 and 8) is quite easy and does not cause any discomfort or disturbance to the subject in performing natural utterances. By visual inspection and some quantitative measurement of the photographic images of the larynx, frame by frame, we can analyze the opening and closing gestures of the glottis as well as the presence or absence of the vocal fold vibration during consonant articulations(Fig. 4). When the vocal pitch is controlled, an apparent change in the distance between the arytenoid and the epiglottis, and the up and down movements of the larynx are usually observable(Fig. 5). A combination of the transillumination technique(photoelectric glottography) with the fiberscopic observation(Fig. 6) provides useful data for detailed analysis of the rapid changes in the glottal conditions(Fig. 7). Some phonetic data of the laryngeal adjustments in speech have been reported elsewhere. Use of the wide-angle model for viewing the pharynx and the velum is now in the stage of preliminary experiment. Results are quite promising. A brief review is also presented on other techniques being employed for observing articulatory movements of the speech organs. The techniques mentioned are:ordinary cineradiography, the new technique of computer controlled tracking of moving objects with use of an x-ray microbeam, the photoelectric(transillumination) method, the ultrasonic measurement, the electrical glottography, and the dynamic palatography.
  • 比企 静雄
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 435-444
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    For the purpose of relating the linguistic units to the acoustical events with the neurophysiological parameter strictly taking into account the constraints of the speech organs, a model of speech production mechanism is studied, which is based on the anatomical structures and physiological natures of the speech organs and also on the nature of the neuromuscular commands to control them. Three dimensional movement of the tongue is simulated combining the deformation of an ellipsoid of tongue body caused by the contraction of the 3 intrinsic muscles, the shift and inclination of the ellipsoid caused by the contraction of the 4 extrinsic muscles, and its rotation caused by jaw opening. Muscles involved in the lip movements are classified into 4 muscle groups considering the effects of the contraction of each muscles on the changes in shape of lips, and the lip movement is simulated by specifying the degree of contraction of each of the 4 muscle groups and by calculating their effects on the changes n shape of the lips. Electromygraphic data of those muscles are utilized to estimate the magnitudes of the neurophysiological parameters to control the contraction of each of those muscles and it is shown that the vocal tract model allows to coordinate observations made in different levels:3-dimensional changes in the shape of the tongue and lips by plaster cast, changes in the mid-sagittal contour of the vocal tract by X-ray photograph, the jaw opening by an optical device and changes in formant frequencies of the resulting speech sound, especilly for the vowel.
  • 藤崎 博也, 須藤 寛
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 445-452
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    Prosodic features in speech can be interpreted as responses of the underlying mechanisms to a set of linguistic commands. This paper presents a quantitative model for the mechanisms of generating fundamental frequency contours of word accent of standard Japanese. All the types of word accent of standard Japanese are characterized by the existence of a transition in the subjective pitch, either upward or downward, at the end of the initial mora, and by the fact that no more than one downward transition is allowed within a word. Table 1 lists are patterns of subjective pitch of all the possible accent types of words that consist of up to 5 morae. These binary patterns, however, never manifest as such in the fundamental frequency contours. Analysis of utterances of a number of speakers (Fig. 1) indicates that the logarithmic fundamental frequency contours of the same word accent, when normalized both in time and in frequency, are essentially similar(Fig. 2 and Fig. 3). These observations lead to the model of Fig. 4 based of the following assumptions:(1) Each type of word accent can be characterized by a unique logarithmic contour. (2) Commands for voicing and accent take the form of binary input to the system. (3) Separate mechanisms exist for voicing and accent, which can be approximated by linear system that convert the binary commands into the respective control signals(Fig. 5). (4) These control signals are combined and applied to the mechanism of glottal oscillation, whose fundamental frequency is an exponential function of the control signal. (5) The glottal mechanism shows hysteresis specified by the onset and cessation of the oscillation(Fig. 6). In order to investigate the validity of the model, fundamental frequency contours of various utterances of isolated words were extracted by a Computer program(Fig. 7) and were analyzed by the method of Analysis-by-Synthesis(Fig. ). A few examples of the comparison of the extracted fundamental frequency contour and its closest approximation obtained by the A-b-S procedure are shown in Fig. 9
  • 藤崎 博也, 川島 崇子
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 453-462
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    Following a brief review of previous experimental studies(Fig. 1) and theories concerning perception of speech, experiments on identification and discrimination of synthetic vowels are described which are designed with special emphasis on the accuracy of stimulus parameters and extensive selection of experimental conditions. The results clearly indicate the influence of categorical judgment in the discrimination of synthetic vowels, disproving the dichotomy of continuous versus categorical modes in speech perception(Fig. 2). On the basis of the assumption that immediate categorical judgment is inherent in the perception of speech sounds, and that categorical judgment overrides comparative judgment whenever feasible in the process of discrimination, a model is presented for the mechanisms and processes involved in the discrimination of speech sounds(Fig. 3). Quantitative expression for the discriminability is derived assuming statistical variations of the phoneme boundary and the short-term memory for stimulus timbre(Eqs. 1 to 3). In order to check the validity of the model and the underlying asuumptions, a further experiment is performed using stimuli generated and compiled by a digital computer, allowing higher accuracy and resolution in the measurement of discriminability. The discrimination curve obtained by the experiment is then compared with theoretical curves derived from the model, and the parameters characterizing various psychological processes are extracted by the method of analysis-by-synthesis(Fig. 4). The close agreement of the measured and the theoretical curves demonstrates the essential validity of the model as compared to other models(Fig. 5).
  • 板倉 文忠, 斎藤 収三
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 463-472
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    In order to achieve a speech compression of high quality, it is necessary to extract, as accurately as possible, feature parameter, such as driving source parameter and spectral parameters, and to reproduced these features of the original speech. The purpose of this paper is to give a brief review on speech compression systems and to propose a new method of statistical speech signal processing for extraction of the spectral envelope and the driving source characteristics, and its application to speech compressing system. At first, the theory of the maximum likelihood spectral estimation is discussed and it is shown that the spectral parameters are extracted by solving linear system of equations with short time autocorrelationas its coefficients. Secondly, the modified autocorrelation method for accurate pitch determination is proposed, by which the ill effects of the spectral envelope appearing in the short time autocorrelation are removed by convolving the autocorrelation and the Fourier coefficients of the inverse spectral envelope. Finally, we deal with the problems about speech synthesis from the speech parameters and the result of computer simulation of this method is described. The quality of speech output was measured by means of an articulation test, from which the articulation reference equivalents are determined. It is shown that the analysis synthesis speech compression system can compress speech information to 5, 000 bits/sec at the expence of 3 dB articulation reference equivalent.
  • 板橋 秀一, 城戸 健一
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 473-482
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    Speech is not merely a physical phenomenon but also one of the forms to express the linguistic event. Therfore it is natural and necessary for the automatic speech recognition to take into account the linguistic aspect of speech. Linguistic information will be given by the meaning, the grammer, the dictionary, the connecting rule of phonological units and so on. Former two have not yet been studied enough and so they can not be utilized for the automatic speech recognition. So, it seems reasonable to limit our present object of study to the automatic recognition of spoken words. From this standpoint, the authors have carried out the study on automatic spoken word recognition system which utilized some of the linguistic rules and the dictionary as shown if Figs. 3 and 4. Speech signal is digitally filtered into four frequency bands at each 10 m. s. These bands have been determined considering formants of vowels or nasals and noise components of consonants. The logarithm of the variance of output of the band M_1, and LT, M_1L etc. in Fig. 1 are used as parameters, which are them transformed into distinctive features. Let X^k_i={X^k_(ir)}^9_(r=1) denote the parameters obtained at each 10 m. s. which should be categorized as the feature plus(k=+) or minus(k=-), where i indicates the material number(i=1〜n)and r represents each of nine parameters. Nine distinctive features are represented by the linear combinations of these parameters such as F(X^k_i)=��^9_(r=1)C_rX^k_(ir). These coefficients are determined so as to maximize the ratio of the variance between two classes {F(X^+_i)} and {F(X^-_i)} to the sum of variances within each class. Phonemes are classified into two groups according to the sign of nine distinctive features as shown in Tab. 1. Average error rate of feature extraction is 10. 5% with 13 words (7 seconds of speech) spoken by a male talker. The series of values of nine distinctive features is segmented primarily with reference to a certain amount of change in feature value, and secoundly, they are segmented by applying the rules which depend of the result of primary segmentation, context, duration of the segment and phoneme connection rules. The input feature matrix is made from the representative features of each segment. On the other hand, an item of the dictionary of 54 words which is represented as a series of phonemes is transformed into a series of features, which then is transformed into a standard feature matrix by applying the phonological rule such as the devocalization. The distance between input and standard feature matrices is calculated for each item of the dictionary and the item of minimum distance from the input is taken as a recognized output(see Fig. 3). According to our experiments, the recognition rate is 42. 0% only with the segmentation rule, 59. 5% with segmentation and phoneme connection rule and 92. 3% with the dictionary in addition to those rules for 13 words spoken by a male talker. 79. 2% of 53 words spoken by the same talker are recognized correctly. Next, we examined the performance of the recognition system equipped with a duration dictionary which contains the typical duration of phonemes in each word(see Fig. 4). The segmentation is performed according to the item of the duration dictionary;the item of minimum distance from the input feature matrix is taken as a recognized output. 92. 3% of 52 words uttered by the same talker(as the one mentioned above)for the standard duration is recognized correctly. Average recognition rate of 10 words spoken by each of another nine male talkers is 70. 0%. The effectiveness of utilization of a word dictionary and some of the linguistic rules to the automatic spoken word recognition is made clear.
  • 迫江 博昭, 千葉 成美
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 483-490
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
    We consider that the pattern matching method with time-normalizing ability is one of the most effective methods for the spoken word recognition, and that it can well be applied to the recognition of continuously spoken words. Speech can be expressed as a vector-valued time function (1), (2) by appropriate feature extraction. Then, the effect of speaking rate variation can be regarded as non-linear transformation of time axis (3), and can well be compensated by minimizing (4)(Fig. 1). Based on these consideration, we evaluate time-normalized similarity S(A, B) by (5). Calculation of S(A, B) is efficiently carried out using dynamic programming technique (15), (16), (17), (Fig. 3). Utilizing this pattern matching scheme, continuously spoken words can by separated into word units by determining the subpattern A^l which is most similar to the stored reference pattern, or for which S(A^l, B) is minimum (19), . (20), (Fig. 4). Based on this segmentation scheme, three methods to recognize continuously spoken words are proposed. Method-a is direct application of this segmentation scheme. In Metod-b, recognition is carried out by evaluating similarity S(A^l, B_mB_n) between concatenated reference patterns (17) and input pattern, and the amount of computation is considerably reduced using the segmentation scheme and the DP technique(Fig. 5). Method-c is modification of Method-b, where, after recognizing each word, matching window is shifted so that it will be able to cover the timing difference of the next word(Fig. 6). These Methods were extensively examined by computer simulation. Average recognition rate of 99. 8 per cent has been obtained for 2400 utterances of Japanese 1-digit numbers of five speakers, and 99. 6 per cent for total of 500 continuously spoken 2-digit numbers of five male speakers by Method-c(Table 1, 2, 3).
  • 比企 静雄
    原稿種別: 本文
    1971 年 27 巻 9 号 p. 491-494
    発行日: 1971/09/10
    公開日: 2017/06/02
    ジャーナル フリー
feedback
Top