A very high quality speech analysis, modification and synthesis system—STRAIGHT—has now been implemented in C language and operated in realtime. This article first provides a brief summary of STRAIGHT components and then introduces the underlying principles that enabled realtime operation. In STRAIGHT, the built-in extended pitch synchronous analysis, which does not require analysis window alignment, plays an important role in realtime implementation. A detailed description of the processing steps, which are based on the so-called “just-in-time” architecture, is presented. Further, discussions on other issues related to realtime implementation and performance measures are also provided. The software will be available to researchers upon request.
In this paper, we propose a low-bit-rate audio codec using a new analysis method named mel-scaled linear predictive analysis (mel-LP analysis). In mel-LP analysis, a spectral envelope is estimated on a mel- or bark-frequency scale, so as to improve the spectral resolution in the low-frequency band. This analysis is accomplished with about a twofold increase in computation over standard LPC analysis. Our codec using mel-LP analysis consists of five key parts: time frequency transformation, flattening of MDCT coefficients using the mel-LP spectral envelope, power normalization, perceptual weighting estimation, and multistage VQ. In subjective experiments, we investigated the performance of our codec using the mel-LP analysis method, through 7-level paired comparison tests. The result shows that the codec using the mel-LP analysis method results in a good performance at a low bit rate, particularly at 16 kbps. In the cases of pop songs, piano music and male speech, sound quality was improved.
An improved method of single-channel noise reduction by blind source separation (BSS) is discussed in this paper. A method of and a system for noise suppression, consisting of two adaptive filters for the first noise-reduced speech and the estimated noise-dominant signal, are developed. Initially, we reduce the noise level by the weighted noise subtraction (WNS) method and obtain the first noise-reduced speech. We consider the square of the complement of the estimated noise degree as a weighting factor during the subtraction. The least-mean-squares (LMS) algorithm that is based on the steepest descent method is implemented in adaptive filtering. The method addresses the situations in which the input signal-to-noise ratio (SNR) varies substantially and performing the specified number of iterations of the LMS algorithm for each SNR is time-consuming. Therefore, we propose a function that can be used to estimate the number of iterations required for a given value of the noise degree. The proposed iteration number reduces the computational time and minimizes the signal regeneration problem. Moreover, good efficiency of the algorithm is achieved by appropriate block length processing. The experimental results confirm the improved performance of the proposed WNS+BSS method.
In this paper, we describe a speech recognition interface system for digital TV (DTV) control. TV systems are currently undergoing digitalization and will become more multifunctional, leading to more complex TV operations. Thus, it is necessary for everyone to be able to use TVs easily, and a speech recognition interface is an important key technology. A speech recognition system, which is designed for home use, particularly for digital TV, must be simple and robust to environmental noises and speaker variations. To provide robustness to noise, we developed a noise reduction technique for house noise and an echo-canceling technique for TV sound. To achieve robustness to speaker variations, we developed new speaker adaptation techniques which are incorporated in the system. These of technologies results in a significant improvement in the recognition performance of the DTV.
In this paper, we provide an overview of a new immersive, cost-effective stereo echo canceller that we have developed recently. To achieve immersive stereo hands-free communication, we expanded the frequency range from 0.1–7 kHz to 0.1–20 kHz and revised the echo reduction processing. To achieve a cost-effective canceller, we revised the adaptive algorithm to reduce the required memory and implemented the entire signal processing in a single fixed-point digital signal processor (DSP). The experiments indicate that the new stereo echo canceller delivers near-end speech and background sound more naturally under the double-talk situation.
We have developed the field recording, recognition and reproduction (FIR3) system to record a sound field for later reproduction with the goal of reconstructing the sound information of a room in another space at another time. In this system, a surrounding microphone array is used to record a sound field. A method for detecting sound source positions using this microphone array is discussed in this paper. First, the microphone array properties were examined. On the basis of the results of this examination, we developed a method in which the multiple signal classification (MUSIC) algorithm and the spatial smoothing technique are integrated and named it “rearrangement and presmoothing for MUSIC” (RAP-MUSIC). Measurement in an actual room showed that, using this method, source positions in a reverberant room can be accurately detected.
In this paper, we present and discuss an educational system in the fields of acoustics and speech science using a series of physical models of the human vocal tract. Because education in acoustics is relevant for several fields related to speech communication, it hosts students from a variety of educational backgrounds. Moreover, we believe that an education in acoustics is important for students of different ages: college, high school, middle school, and even elementary school students. Because of the varied student populations, we develop an educational system that instructs students intuitively and effectively and consists of the following models: lung models, an artificial larynx, Arai’s models (cylinder and plate type models), Umeda and Teranishi’s model (a variable-shape model), and head-shaped models. These models effectively demonstrate several principal aspects of speech production, such as phonation, source-filter theory, the relationship between vocal-tract shape/tongue movement and vowel quality, and nasalization of vowels. We have confirmed that combining the models in an effective way produces complete education in the acoustics of speech production. The examinations and questionnaire surveys conducted before and after using our proposed system revealed that the learners’ understanding of what improves with the use of the system. The system is also effective for voice and articulatory training in speech pathology and language learning.
Medical Doppler ultrasound system has time varying spectrum display and Doppler audio stereo output based on blood flow. We examined the digital signal processing system of Doppler audio from a viewpoint of cost and size reduction. We newly developed the direction separation system of Doppler audio processing interlocking with spectrum Doppler image processing for aliasing. We made the target performance, developed three kinds of signal processing systems, and evaluated each system. Consequently, we could know that the complex IIR filter system was excellent in a response and low calculation load. We confirmed the Doppler audio signal processing for aliasing by simulation and could solve the problem caused by conventional system.