Acoustical Science and Technology

INVITED REVIEW

The use of articulatory movement data in speech synthesis applications: An overview — Application of articulatory movements using machine learning algorithms —

Korin Richmond, Zhenhua Ling, Junichi Yamagishi

2015 年 36 巻 6 号 p. 467-477
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.467

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper describes speech processing work in which articulator movements are used in conjunction with the acoustic speech signal and/or linguistic information. By ``articulator movements,'' we mean the changing positions of human speech articulators such as the tongue and lips, which may be recorded by electromagnetic articulography (EMA), amongst other articulography techniques. Specifically, we provide an overview of: i) inversion mapping techniques, where we estimate articulator movements from a given new speech waveform automatically; ii) statistical voice conversion and speech synthesis techniques which use articulator movements as part of the process to generate synthetic speech, and also make it intuitively controllable via articulation; and iii) automatic prediction (or synthesis) of articulator movements from any given new text input.

抄録全体を表示

PDF形式でダウンロード (488K)

—Special Issue on Applied System—

PAPERS

Real-time robust formant estimation system using a phase equalization-based autoregressive exogenous model

Hiroki Oohashi, Sadao Hiroya, Takemi Mochida

2015 年 36 巻 6 号 p. 478-488
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.478

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper presents a real-time robust formant tracking system for speech using a real-time phase equalization-based autoregressive exogenous model (PEAR) with electroglottography (EGG). Although linear predictive coding (LPC) analysis is a popular method for estimating formant frequencies, it is known that the estimation accuracy for speech with high fundamental frequency F₀ would be degraded since the harmonic structure of the glottal source spectrum deviates more from the Gaussian noise assumption in LPC as its F₀ increases. In contrast, PEAR, which employs phase equalization and LPC with an impulse train as the glottal source signals, estimates formant frequencies robustly even for speech with high F₀. However, PEAR requires higher computational complexity than LPC. In this study, to reduce this computational complexity, a novel formulation of PEAR was derived, which enabled us to implement PEAR for a real-time robust formant tracking system. In addition, since PEAR requires timings of glottal closures, a stable detection method using EGG was devised. We developed the real-time system on a digital signal processor and showed that, for both the synthesized and natural vowels, the proposed method can estimate formant frequencies more robustly than LPC against a wider range of F₀.

抄録全体を表示

PDF形式でダウンロード (974K)
GPU-based real-time beamforming for large arrays of optical wireless acoustic sensors

Gabriel Pablo Nava, Hoang Duy Nguyen, Yusuke Hioka, Yutaka Kamamoto, T ...

2015 年 36 巻 6 号 p. 489-499
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.489

ジャーナルフリー

抄録を表示する抄録を非表示にする

Recent optical wireless acoustic sensors have demonstrated the possibility to simultaneously sense massive numbers of audio channels in real time. Although this technology has enabled the deployment of large-scale applications, it raises new challenges from the computational perspective. In this regard, Graphics Processing Units provide significant parallel computational power. However, not all the existent algorithms are GPU-implementable in a straightforward way. This paper discusses signal processing schemes and implementation strategies to achieve real-time broadband beamforming using a single GPU card. The experiments introduced here, show our prototype implementation handling over 120 audio channels in real time. The experimental results further highlight the particular advantages of using a video camera-based approach to improve the beamforming performance.

抄録全体を表示

PDF形式でダウンロード (1618K)
Evaluation of tooth-conduction microphone for communication under noisy environment

Yusuke Torikai, Dai Kuze, Junko Kurosawa, Yasuhiro Oikawa, Yoshio Yama ...

2015 年 36 巻 6 号 p. 500-506
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.500

ジャーナルフリー

抄録を表示する抄録を非表示にする

We investigated a new communication-aid system focused on bone-conduction through a tooth, for listening to and recording voices. In this paper, we developed a tooth-conduction microphone (TCM) and evaluate the articulation of tooth-conducted voice (TCV). Because the TCM has the shape of one's dental mold, it is wearable like a mouthpiece. Moreover, it can extract tooth vibration during phonation as TCV. To evaluate articulation of TCV, we adopted monosyllable articulation for subjective assessment and linear predictive coding cepstral distance for objective assessment. The results of articulation show that TCV is not sufficiently clear compared to air-conducted. However, it is confirmed that TCV is robust to environmental noise because the accuracy rate is not decreased when the TCV is recorded under high ambient noise.

抄録全体を表示

PDF形式でダウンロード (1436K)
Effective speech suppression using a two-channel microphone array for privacy protection in face-to-face sales monitoring

Osamu Ichikawa, Takashi Fukuda, Ryuki Tachibana

2015 年 36 巻 6 号 p. 507-515
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.507

ジャーナルフリー

抄録を表示する抄録を非表示にする

In the financial industry, face-to-face conversation is an essential for sales. Similar to call-center monitoring, there is a significant need to monitor the conversation for compliance checks. In certain business scenarios, there is a need to record an employee's speech while protecting the customers' confidentiality and privacy. In this paper, we propose a small-scale microphone array system specially designed to record only the agent's speech. For the suppression of the customer's speech, we used CSP-based post-filtering. However, using small number of microphones, it is difficult to suppress unwanted speech completely. Because post-filtering using correlations of the multiple channels often affected by the spatial aliasing between speakers. We introduced the weighted-CSP to attenuate susceptible bins to the interfering speaker. Also we introduced flooring after the post-filtering to mask residuals. This combination helps prevent the customer's speech to be transcribed.

抄録全体を表示

PDF形式でダウンロード (1475K)
Sound-space recording and binaural presentation system based on a 252-channel microphone array

Shuichi Sakamoto, Satoshi Hongo, Takuma Okamoto, Yukio Iwaya, Yôi ...

2015 年 36 巻 6 号 p. 516-526
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.516

ジャーナルフリー

抄録を表示する抄録を非表示にする

Sensing of high-definition three-dimensional (3D) sound-space information is of crucial importance for realizing total 3D spatial sound technology. We have proposed a sensing method for 3D sound-space information using symmetrically and densely arranged microphones. This method is called SENZI (Symmetrical object with ENchased Zillion microphones). In the SENZI method, signals recorded by the microphones are simply weighted and summed to synthesize a listener's head-related transfer functions (HRTFs), reflecting the direction in which the listener is facing even after recording. The SENZI method is being developed as a real-time system using a spherical microphone array and field-programmable gate arrays (FPGAs). In the SENZI system, 252 electric condenser microphones (ECMs) were almost uniformly distributed on a rigid sphere. The deviations of the microphone frequency responses were compensated for using the transfer function of the rigid sphere. To avoid the degradation of the accuracy of the synthesized sound space by microphone internal noise, particularly in the low-frequency region, we analyzed the effect of the signal-to-noise ratio (SNR) of microphones on the accuracy of synthesized sound-space information by controlling condition numbers of matrix constructed from transfer functions. On the basis of the results of these analyses, a compact SENZI system was implemented. Results of experiments indicated that 3D sound-space information was well expressed using the system.

抄録全体を表示

PDF形式でダウンロード (3631K)
Data-glove-driven vocal tract configuration methods for vowel synthesis

Kohichi Ogata, Kohei Matsumura, Yusuke Matsuda

2015 年 36 巻 6 号 p. 527-536
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.527

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper describes data-glove-driven vocal tract configuration methods. Unlike direct mapping from hand gestures to sounds using a data glove, intuitive manipulation of the data glove was applied to configure the vocal tract shape. Two manipulation methods were proposed and then evaluated in terms of the vocal tract area function, resulting formant frequencies and ease of manipulation. It was revealed that although both methods were capable of producing the resulting formant frequencies with reasonable accuracy for steady vowel production, the method with three fingers enabled users to easily configure the vocal tract shape. Moreover, the effect of training in manipulating the data glove to configure the vocal tract shape for continuous vowels was evaluated in terms of their sound spectrograms and the distribution of the resulting formant frequencies. An experiment to evaluate the effectiveness of training proved that beginners were able to produce continuous vowels within about three training sessions.

抄録全体を表示

PDF形式でダウンロード (1723K)

ACOUSTICAL LETTER

Development of dynamic crosstalk cancellation system for multiple-listener binaural reproduction

Hiroaki Kurabayashi, Makoto Otani, Masami Hashimoto, Mizue Kayama

2015 年 36 巻 6 号 p. 537-539
発行日: 2015年
公開日: 2015/11/01

DOIhttps://doi.org/10.1250/ast.36.537

ジャーナルフリー

PDF形式でダウンロード (592K)

J-STAGEへの登録はこちら（無料）