This paper describes speech processing work in which articulator movements are used in conjunction with the acoustic speech signal and/or linguistic information. By ``articulator movements,'' we mean the changing positions of human speech articulators such as the tongue and lips, which may be recorded by electromagnetic articulography (EMA), amongst other articulography techniques. Specifically, we provide an overview of: i) inversion mapping techniques, where we estimate articulator movements from a given new speech waveform automatically; ii) statistical voice conversion and speech synthesis techniques which use articulator movements as part of the process to generate synthetic speech, and also make it intuitively controllable via articulation; and iii) automatic prediction (or synthesis) of articulator movements from any given new text input.
This paper presents a real-time robust formant tracking system for speech using a real-time phase equalization-based autoregressive exogenous model (PEAR) with electroglottography (EGG). Although linear predictive coding (LPC) analysis is a popular method for estimating formant frequencies, it is known that the estimation accuracy for speech with high fundamental frequency F0 would be degraded since the harmonic structure of the glottal source spectrum deviates more from the Gaussian noise assumption in LPC as its F0 increases. In contrast, PEAR, which employs phase equalization and LPC with an impulse train as the glottal source signals, estimates formant frequencies robustly even for speech with high F0. However, PEAR requires higher computational complexity than LPC. In this study, to reduce this computational complexity, a novel formulation of PEAR was derived, which enabled us to implement PEAR for a real-time robust formant tracking system. In addition, since PEAR requires timings of glottal closures, a stable detection method using EGG was devised. We developed the real-time system on a digital signal processor and showed that, for both the synthesized and natural vowels, the proposed method can estimate formant frequencies more robustly than LPC against a wider range of F0.
Recent optical wireless acoustic sensors have demonstrated the possibility to simultaneously sense massive numbers of audio channels in real time. Although this technology has enabled the deployment of large-scale applications, it raises new challenges from the computational perspective. In this regard, Graphics Processing Units provide significant parallel computational power. However, not all the existent algorithms are GPU-implementable in a straightforward way. This paper discusses signal processing schemes and implementation strategies to achieve real-time broadband beamforming using a single GPU card. The experiments introduced here, show our prototype implementation handling over 120 audio channels in real time. The experimental results further highlight the particular advantages of using a video camera-based approach to improve the beamforming performance.
We investigated a new communication-aid system focused on bone-conduction through a tooth, for listening to and recording voices. In this paper, we developed a tooth-conduction microphone (TCM) and evaluate the articulation of tooth-conducted voice (TCV). Because the TCM has the shape of one's dental mold, it is wearable like a mouthpiece. Moreover, it can extract tooth vibration during phonation as TCV. To evaluate articulation of TCV, we adopted monosyllable articulation for subjective assessment and linear predictive coding cepstral distance for objective assessment. The results of articulation show that TCV is not sufficiently clear compared to air-conducted. However, it is confirmed that TCV is robust to environmental noise because the accuracy rate is not decreased when the TCV is recorded under high ambient noise.
In the financial industry, face-to-face conversation is an essential for sales. Similar to call-center monitoring, there is a significant need to monitor the conversation for compliance checks. In certain business scenarios, there is a need to record an employee's speech while protecting the customers' confidentiality and privacy. In this paper, we propose a small-scale microphone array system specially designed to record only the agent's speech. For the suppression of the customer's speech, we used CSP-based post-filtering. However, using small number of microphones, it is difficult to suppress unwanted speech completely. Because post-filtering using correlations of the multiple channels often affected by the spatial aliasing between speakers. We introduced the weighted-CSP to attenuate susceptible bins to the interfering speaker. Also we introduced flooring after the post-filtering to mask residuals. This combination helps prevent the customer's speech to be transcribed.
Sensing of high-definition three-dimensional (3D) sound-space information is of crucial importance for realizing total 3D spatial sound technology. We have proposed a sensing method for 3D sound-space information using symmetrically and densely arranged microphones. This method is called SENZI (Symmetrical object with ENchased Zillion microphones). In the SENZI method, signals recorded by the microphones are simply weighted and summed to synthesize a listener's head-related transfer functions (HRTFs), reflecting the direction in which the listener is facing even after recording. The SENZI method is being developed as a real-time system using a spherical microphone array and field-programmable gate arrays (FPGAs). In the SENZI system, 252 electric condenser microphones (ECMs) were almost uniformly distributed on a rigid sphere. The deviations of the microphone frequency responses were compensated for using the transfer function of the rigid sphere. To avoid the degradation of the accuracy of the synthesized sound space by microphone internal noise, particularly in the low-frequency region, we analyzed the effect of the signal-to-noise ratio (SNR) of microphones on the accuracy of synthesized sound-space information by controlling condition numbers of matrix constructed from transfer functions. On the basis of the results of these analyses, a compact SENZI system was implemented. Results of experiments indicated that 3D sound-space information was well expressed using the system.
This paper describes data-glove-driven vocal tract configuration methods. Unlike direct mapping from hand gestures to sounds using a data glove, intuitive manipulation of the data glove was applied to configure the vocal tract shape. Two manipulation methods were proposed and then evaluated in terms of the vocal tract area function, resulting formant frequencies and ease of manipulation. It was revealed that although both methods were capable of producing the resulting formant frequencies with reasonable accuracy for steady vowel production, the method with three fingers enabled users to easily configure the vocal tract shape. Moreover, the effect of training in manipulating the data glove to configure the vocal tract shape for continuous vowels was evaluated in terms of their sound spectrograms and the distribution of the resulting formant frequencies. An experiment to evaluate the effectiveness of training proved that beginners were able to produce continuous vowels within about three training sessions.