This paper proposes a blind calculation method for the poles common to multiple signal transmission paths. In the field of room acoustics, the poles correspond to the mode frequencies that are determined by room size and shape, and they do not change when source and receiver locations change. Information on these acoustic poles is useful for many applications, including echo cancellation and sound field equalization in a room. Conventional pole estimation methods require a priori measurement of the room transfer functions. This paper proposes a new method for the blind calculation of the poles, where the poles are calculated solely from the observed signals. Simulation results show that the proposed algorithm provides precise estimates of the common poles.
Several adaptive algorithms for robust echo cancellation use nonlinear reference and/or error functions. Most of them require time-variant threshold estimators, e.g., noise level estimators or double-talk detectors, since their nonlinearities have to be adjusted in response to changes in near-end noise or speech signal levels. We propose a new frequency domain adaptive algorithm: the gradient-limited fast least-mean-squares (GL-FLMS), in which the coefficients are updated by using a nonlinear function of the error scaled by the reference magnitude, i.e., the error-to-reference ratio (ERR). When the acoustic coupling level between loudspeaker and microphone is bounded, the ERR is also bounded in the case of single-talk, but may increase during double-talk. The GL-FLMS limits unexpected increases in the ERR with fixed thresholds and prevents divergence of the coefficients, while not neglecting updates to adjust when a large reference signal introduces a large error during single-talk.
Morphological measurements of the hypopharynx are conducted to investigate the correlation between fine structures of the vocal tract and speaker characteristics. The hypopharynx includes the laryngeal tube and bilateral cavities of the piriform fossa. MRI data during sustained phonation of the five Japanese vowels by four subjects are obtained to analyze intra- and inter-speaker variation of the hypopharynx. Morphological analysis on the mid-sagittal and transverse planes revealed that the shape of the hypopharynx was relatively stable, regardless of vowel type, in contrast to relatively large inter-speaker variation, and these results are confirmed quantitatively by a simple similarity method. The small intra-speaker variation of the hypopharynx is confirmed by further morphological analysis using high-quality MRI data for one of the subjects, obtained by using the “phonation-synchronized method” and “custom laryngeal coil.” Furthermore, acoustical effects of the individual variation of the hypopharynx are estimated by using a transmission line model. Vocal tract area function of one of the subjects above the hypopharynx is combined with the hypopharyngeal cavities of other subjects, and their transfer functions are calculated. The results show that the inter-speaker variation of the hypopharynx affects spectra in the frequency range beyond approximately 2.5 kHz.
When a portion of a sound is replaced by a noise burst, its duration is perceived to be shorter than that of its intact counterpart. To test the robustness of this shrinking effect by noise replacement and to validate the hypothesis that duration can be estimated as a function of accumulated perceptual evidence for the target sound, the shrinking effect was investigated with tonal stimuli in two contextual temporal structures. Two experiments are conducted using (1) a tone with an envelope pattern copied from a naturally spoken word, and (2) an isochronous sequence of four tones. In most cases, the noise replacement causes the perceived duration of the target tone to shrink from that of its intact counterpart. However, a reversal/prolongation tendency by noise was observed for the stimulus with a deviation slightly shorter than an isochronous structure in the second experiment. Although this reversal tendency partially supports the hypothesis that a noise merely enhances a contextual effect (the contextual enhancement hypothesis), the shrinking effect observed under the other conditions was difficult to explain by the contextual enhancement hypothesis. The shrinking effect could be explained in a framework of the traditional neural counting mechanisms with one additional mechanism to control the degree of gate opening depending on the perceptual evidence of the target sound.
The duration of sounds generally tends to be perceived as shorter when a portion is replaced by a noise burst. However, a reversal/prolongation tendency can occur if a compelling isochronous context is functioning. To test the robustness of the durational shrinkage as well as to investigate what aspect is the core feature providing the isochronism, three experiments are conducted using (1) a non-isochronous sequence of four tones, (2) a four-tone sequence whose interonset intervals fluctuate randomly, and (3) a four-tone sequence whose interonset intervals are fixed to be isochronous irrespective of adjustment by human participants in the experiment. In most cases, the noise replacement causes the perceived duration of the target tone to shrink compared to that of its intact counterpart. Furthermore, the reduction of isochronous context results in the reduction of the reversal tendency, although the shrinking effect cannot be observed clearly either. The effects of noise replacement and context are discussed in relation to the contribution of local cues provided by the perceptual evidence as well as the contribution of a global cue provided by an isochronous interonset interval.
An improved backward prediction coder featuring two-stage vector quantization (VQ) of shape codevectors is presented. Efficient two-stage VQ is achieved using the wavelet coefficients of excitation signals; i.e., wavelet coefficients are calculated by applying a discrete wavelet transform to excitation signals, and the results are divided into an approximation group and a detail group. The data lengths of both approximation and detail coefficients are half that of conventional two-stage VQ systems. Simulation results show that the proposed coder achieves a better weighted signal-to-noise ratio (WSNR) than conventional coders and, in terms of reconstructed speech quality, ranks between the FS-1016 Code Excited Linear Prediction (CELP) coder and the Vector Sum Excited Linear Predictive Coding (VSELP) coder.
Detection threshold for distortions due to time jitter was measured in a 2 alternative forced choice paradigm with switching sounds. Music signals with random jitter were simulated on the digital domain. The size of jitter was arbitrary controlled so that the detection threshold could be estimated. Professional audio engineers, sound engineers, audio critics and semi-professional musicians participated as listeners. The listeners were allowed to use their own listening environments and their favorite sound materials. It was shown that the detection threshold for random jitter was several hundreds ns for well-trained listeners under their preferable listening conditions.
By providing a phase reversal at the focal region, another phase reversal due to diffraction is compensated and as a result, an enhancement of the second harmonic generation is expected. To attain this, the reflection of the focused beam at a water surface being set in the focal region is experimentally observed using a focusing source to receive the reflected second harmonic sound by itself, which employs a LiNbO3 plate with a ferroelectric inversion layer. The experimental result is compared with the theoretical calculation based on the Khokhlov-Zabolotskaya-Kuznetsov equation, where the condition of the phase reversal for both the fundamental and second harmonic components is assumed to be at the free surface. The experimental result agrees reasonably well with the predicted increase in the second harmonic amplitude by 2.0 times. Since this growth rate is sensitive to the velocity dispersion that occurs in different liquid media, such a measurement of second harmonic component may be potentially useful for estimating the dispersion.
The present status, progress and usage of Japanese speech database has been described. The database project in Japan started in the early 1980s. The first was by the Japan Electronic Industry Development Association (JEIDA), which aimed at creating a speech database to evaluate performance of the existing speech input/output machines and systems. Several database projects have been undertaken since then, including the one initiated by the Advanced Telecommunication Research Institute (ATR), and now we have reached a point where an enormous amount of spontaneous speech data is available. A survey was conducted recently on usage of the presently existing speech databases among industry and university institutions in Japan where speech research is now actively going on. It was revealed that the ATR’s continuous speech database is the most frequently used, followed by the equivalent version from the Acoustical Society of Japan.