Acoustical Science and Technology

FOREWORD

Foreword to the special issue on ``the speech communication and its related technologies''

Akinori Ito

2013Volume 34Issue 2 Pages 63
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.63

JOURNAL FREE ACCESS

Download PDF (25K)

INVITED PAPER

Conversational robots: An approach to conversation protocol issues that utilizes the paralinguistic information available in a robot-human setting

Tetsunori Kobayashi, Shinya Fujie

2013Volume 34Issue 2 Pages 64-72
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.64

JOURNAL FREE ACCESS

Show abstractHide abstract

Functions have been implemented in various robots to enable them to follow a conversation protocol. The paralinguistic information involved in prosody and posture expression is used to improve the transparency of the conversational states, especially the protocol, thereby effectively contributing to natural and efficient communication. Information is communicated incrementally to enable error handling. Various rules for selecting conversation participants, forming a communication group, and turn-taking are followed. Since all the actions of a conversational robot are explicitly controlled, such robots should be useful for revealing important heretofore unknown conversational functions.

View full abstract

Download PDF (1466K)

INVITED REVIEW

A review of paralinguistic information processing for natural speech communication

Yoichi Yamashita

2013Volume 34Issue 2 Pages 73-79
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.73

JOURNAL FREE ACCESS

Show abstractHide abstract

Speech conveys not only linguistic information but also supplemental information that is not inferable from written language, such as attitude, speaking style, intention, emotion, mental state, and so on, and is called para- or non-linguistic information. This type of information plays important roles for smooth and natural communication through spoken language. This paper reviews recognition and synthesis techniques for speech communication focusing on emotion and emphasis as well as corpora that are dispensable to development of current speech technologies.

View full abstract

Download PDF (84K)

PAPERS

Multiple-nulls-steering beamformer based on both talker and noise direction-of-arrival estimation

Masato Nakayama, Takanobu Nishiura, Yoichi Yamashita, Noboru Nakasako

2013Volume 34Issue 2 Pages 80-88
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.80

JOURNAL FREE ACCESS

Show abstractHide abstract

Beamforming with a microphone-array is an ideal candidate for distant-talking speech recognition. An adaptive beamformer can achieve beamforming with a small microphone-array, but it had difficulty extracting distant-moving speech and reducing moving noises, because it must rapidly train long multiple-channel adaptive filters by using observed noises with a microphone-array. However, if positions of both talkers and noises can be estimated, adaptive filters may not need to be trained in real noisy environments. Therefore, we propose a multiple-nulls-steering beamformer based on both talker and noise localization that does not require adaptive training with observed noises. Finally, we confirmed the validity and effectiveness of the proposed method through computer simulations and evaluation experiments in real noisy environments.

View full abstract

Download PDF (1967K)
Paralinguistic information affects phonation types: A case study using high-speed video images

Masako Fujimoto, Kikuo Maekawa

2013Volume 34Issue 2 Pages 89-93
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.89

JOURNAL FREE ACCESS

Show abstractHide abstract

A case study on the correlation between phonation type and paralinguistic information in Japanese has carried out using a high-speed digital video imaging system. The results showed that ``breathy'' and ``creaky'' phonations corresponded to ``disappointment'' and ``suspicion''-related utterances, respectively. Such influence of paralinguistic information stretches over segments including voiceless consonants. This means the alteration due to paralinguistic information is not limited to voice quality but to whole settings of the larynx. These findings are in accord with those of our articulatory study. They suggest that the domain of the phonatory and articulatory setting due to paralinguistic information is the whole utterances, rather than individual segments.

View full abstract

Download PDF (583K)
Lexical types and tokens found in the classroom speech of native and non-native English language instructors in a Japanese high school

Noriaki Katagiri, Goh Kawai

2013Volume 34Issue 2 Pages 94-104
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.94

JOURNAL FREE ACCESS

Show abstractHide abstract

Within the context of English language taught solely using English language at Japan's secondary schools, no research quantifies the differences between native instructors (first language English, may or may not speak Japanese) and non-native instructors (first language Japanese; second language English). We developed a video corpus of an English language classroom, and examined the speech of 3 native and 1 non-native instructors. The corpus contains 49 English lessons of 45 minutes each in a Japanese public high school with monolingual learners of English as a foreign language. The native and non-native instructors occasionally taught together. Almost all speech in the lessons was in English. We compared lexical tokens and types found in our transcriptions with a collection of typical classroom English dialogues, and a wordlist created from large bodies of written and spoken English. We obtained the distributions of words, and words preferred by either native or non-native instructors. Results suggest that (a) native and non-native instructors share a core vocabulary of classroom English, (b) native instructors teach vocabulary depth via open-ended conversations, (c) non-native instructors teach vocabulary breadth via textbook explanations, and (d) native and non-native instructors differ in teaching roles but not in language ability.

View full abstract

Download PDF (706K)
Low-complexity PARCOR coefficient quantization and prediction order estimation designed for entropy coding of prediction residuals

Yutaka Kamamoto, Takehiro Moriya, Noboru Harada, Yusuke Hiwasaki

2013Volume 34Issue 2 Pages 105-112
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.105

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper presents a set of low-complexity tools used in lossless coding of G.711 bitstream, based on linear prediction. One is an algorithm for quantizing the PARCOR/reflection coefficients and the other is an estimation method for the optimal prediction order. Both tools are based on a criterion that minimizes the entropy of the prediction residual signal and can be implemented in fixed-point arithmetic at very low-complexity. Since proposed methods show efficient performance in terms of compression and complexity, they are adopted in the Recommendation ITU-T G.711.0, a new standard for lossless compression of G.711 (A-law/μ-law logarithmic PCM) payload.

View full abstract

Download PDF (661K)
A parametric method of computing acoustic characteristics of simplified three-dimensional vocal-tract model with wall impedance

Kunitoshi Motoki

2013Volume 34Issue 2 Pages 113-122
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.113

JOURNAL FREE ACCESS

Show abstractHide abstract

A method of computing the acoustic characteristics of a simplified three-dimensional vocal-tract model with wall impedance is presented. The acoustic field is represented in terms of both plane waves and higher order modes in tubes. This model is constructed using an asymmetrically connected structure of rectangular acoustic tubes, and can parametrically represent acoustic characteristics at higher frequencies where the assumption of plane wave propagation does not hold. The propagation constants of the higher order modes are calculated taking account of wall impedance. The resonance characteristics of the vocal-tract model are evaluated using the radiated acoustic power. Computational results show an increase in bandwidth and a small upward shift of peaks, particularly at lower frequencies, as already suggested by the one-dimensional model. It is also shown that the sharp peaks at higher frequencies are less sensitive to the values of wall impedance even though the attenuation of the higher order modes is larger than that of plane waves.

View full abstract

Download PDF (860K)
A generation error function considering dynamic properties of speech parameters for minimum generation error training for hidden Markov model-based speech synthesis

Duy Khanh Ninh, Masanori Morise, Yoichi Yamashita

2013Volume 34Issue 2 Pages 123-132
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.123

JOURNAL FREE ACCESS

Show abstractHide abstract

A minimum generation error (MGE) criterion has been proposed for model training in hidden Markov model (HMM)-based speech synthesis to minimize the error between generated and original static parameter sequences of speech. However, dynamic properties of speech parameters are ignored in the generation error definition. In this study, we incorporate these dynamic properties into MGE training by introducing the error component of dynamic features (i.e., delta and delta-delta parameters) into the generation error function. We propose two methods for setting the weight associated with the additional error component. In the fixed weighting approach, this weight is kept constant over the course of speech. In the adaptive weighting approach, it is adjusted according to the degree of dynamicity of speech segments. An objective evaluation shows that the newly derived MGE criterion with the adaptive weighting method results in comparable performance for the static feature and better performance for the delta feature compared with the baseline MGE criterion. Subjective listening tests exhibit a small but statistically significant improvement in the quality of speech synthesized by the proposed technique. The newly derived criterion improves the capability of HMMs in capturing dynamic properties of speech without increasing the computational complexity of the training process compared with the baseline criterion.

View full abstract

Download PDF (438K)
Noise suppression method for preprocessor of time-lag speech recognition system based on bidirectional optimally modified log spectral amplitude estimation

Yasunari Obuchi, Ryu Takeda, Masahito Togami

2013Volume 34Issue 2 Pages 133-141
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.133

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we propose a new noise suppression method, that is best used as a preprocessor for time-lag speech recognition. Assuming that a time lag of a few seconds is acceptable in various speech recognition applications, the proposed method is realized as a combination of forward and backward estimation flows over time. Each estimation flow is based on the optimally modified log spectral amplitude (OM-LSA) speech estimator, but a look-ahead estimation mechanism is additionally equipped to make the estimation more robust. Evaluation experiments using various databases confirm that the speech recognition accuracy can be greatly improved by adding the proposed method to the existing system.

View full abstract

Download PDF (959K)

TECHNICAL REPORT

Time-reversed reverberation yields lower speech recognition rate by human and machine

Takayuki Arai

2013Volume 34Issue 2 Pages 142-146
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.142

JOURNAL FREE ACCESS

Show abstractHide abstract

We first compared a speech signal with two reverberations, normal reverberation and its time-reversed version, that have the same modulation transfer function. Results showed that intelligibility of speech with the time-reversed reverberation was significantly less than that with the normal reverberation. We then compared the results of human speech recognition (HSR) with those of automatic speech recognition (ASR) to see whether a similar tendency could be observed in both cases. Results showed the similar asymmetry in ASR, but we found the HSR was more tolerant even though reverberation becomes longer. Finally, we discussed factors of asymmetric temporal properties in speech production and perception that current speech recognizers do not have.

View full abstract

Download PDF (284K)

ACOUSTICAL LETTER

Dynamic aspects of aizuchi and its influence on the naturalness of dialogues

Hiroki Mori

2013Volume 34Issue 2 Pages 147-149
Published: February 01, 2013
Released on J-STAGE: March 01, 2013

DOIhttps://doi.org/10.1250/ast.34.147

JOURNAL FREE ACCESS

Download PDF (268K)

Register with J-STAGE for free!