Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Volume 34, Issue 2
Displaying 1-12 of 12 articles from this issue
FOREWORD
INVITED PAPER
INVITED REVIEW
  • Yoichi Yamashita
    2013 Volume 34 Issue 2 Pages 73-79
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    Speech conveys not only linguistic information but also supplemental information that is not inferable from written language, such as attitude, speaking style, intention, emotion, mental state, and so on, and is called para- or non-linguistic information. This type of information plays important roles for smooth and natural communication through spoken language. This paper reviews recognition and synthesis techniques for speech communication focusing on emotion and emphasis as well as corpora that are dispensable to development of current speech technologies.
    Download PDF (84K)
PAPERS
  • Masato Nakayama, Takanobu Nishiura, Yoichi Yamashita, Noboru Nakasako
    2013 Volume 34 Issue 2 Pages 80-88
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    Beamforming with a microphone-array is an ideal candidate for distant-talking speech recognition. An adaptive beamformer can achieve beamforming with a small microphone-array, but it had difficulty extracting distant-moving speech and reducing moving noises, because it must rapidly train long multiple-channel adaptive filters by using observed noises with a microphone-array. However, if positions of both talkers and noises can be estimated, adaptive filters may not need to be trained in real noisy environments. Therefore, we propose a multiple-nulls-steering beamformer based on both talker and noise localization that does not require adaptive training with observed noises. Finally, we confirmed the validity and effectiveness of the proposed method through computer simulations and evaluation experiments in real noisy environments.
    Download PDF (1967K)
  • Masako Fujimoto, Kikuo Maekawa
    2013 Volume 34 Issue 2 Pages 89-93
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    A case study on the correlation between phonation type and paralinguistic information in Japanese has carried out using a high-speed digital video imaging system. The results showed that ``breathy'' and ``creaky'' phonations corresponded to ``disappointment'' and ``suspicion''-related utterances, respectively. Such influence of paralinguistic information stretches over segments including voiceless consonants. This means the alteration due to paralinguistic information is not limited to voice quality but to whole settings of the larynx. These findings are in accord with those of our articulatory study. They suggest that the domain of the phonatory and articulatory setting due to paralinguistic information is the whole utterances, rather than individual segments.
    Download PDF (583K)
  • Noriaki Katagiri, Goh Kawai
    2013 Volume 34 Issue 2 Pages 94-104
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    Within the context of English language taught solely using English language at Japan's secondary schools, no research quantifies the differences between native instructors (first language English, may or may not speak Japanese) and non-native instructors (first language Japanese; second language English). We developed a video corpus of an English language classroom, and examined the speech of 3 native and 1 non-native instructors. The corpus contains 49 English lessons of 45 minutes each in a Japanese public high school with monolingual learners of English as a foreign language. The native and non-native instructors occasionally taught together. Almost all speech in the lessons was in English. We compared lexical tokens and types found in our transcriptions with a collection of typical classroom English dialogues, and a wordlist created from large bodies of written and spoken English. We obtained the distributions of words, and words preferred by either native or non-native instructors. Results suggest that (a) native and non-native instructors share a core vocabulary of classroom English, (b) native instructors teach vocabulary depth via open-ended conversations, (c) non-native instructors teach vocabulary breadth via textbook explanations, and (d) native and non-native instructors differ in teaching roles but not in language ability.
    Download PDF (706K)
  • Yutaka Kamamoto, Takehiro Moriya, Noboru Harada, Yusuke Hiwasaki
    2013 Volume 34 Issue 2 Pages 105-112
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    This paper presents a set of low-complexity tools used in lossless coding of G.711 bitstream, based on linear prediction. One is an algorithm for quantizing the PARCOR/reflection coefficients and the other is an estimation method for the optimal prediction order. Both tools are based on a criterion that minimizes the entropy of the prediction residual signal and can be implemented in fixed-point arithmetic at very low-complexity. Since proposed methods show efficient performance in terms of compression and complexity, they are adopted in the Recommendation ITU-T G.711.0, a new standard for lossless compression of G.711 (A-law/μ-law logarithmic PCM) payload.
    Download PDF (661K)
  • Kunitoshi Motoki
    2013 Volume 34 Issue 2 Pages 113-122
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    A method of computing the acoustic characteristics of a simplified three-dimensional vocal-tract model with wall impedance is presented. The acoustic field is represented in terms of both plane waves and higher order modes in tubes. This model is constructed using an asymmetrically connected structure of rectangular acoustic tubes, and can parametrically represent acoustic characteristics at higher frequencies where the assumption of plane wave propagation does not hold. The propagation constants of the higher order modes are calculated taking account of wall impedance. The resonance characteristics of the vocal-tract model are evaluated using the radiated acoustic power. Computational results show an increase in bandwidth and a small upward shift of peaks, particularly at lower frequencies, as already suggested by the one-dimensional model. It is also shown that the sharp peaks at higher frequencies are less sensitive to the values of wall impedance even though the attenuation of the higher order modes is larger than that of plane waves.
    Download PDF (860K)
  • Duy Khanh Ninh, Masanori Morise, Yoichi Yamashita
    2013 Volume 34 Issue 2 Pages 123-132
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    A minimum generation error (MGE) criterion has been proposed for model training in hidden Markov model (HMM)-based speech synthesis to minimize the error between generated and original static parameter sequences of speech. However, dynamic properties of speech parameters are ignored in the generation error definition. In this study, we incorporate these dynamic properties into MGE training by introducing the error component of dynamic features (i.e., delta and delta-delta parameters) into the generation error function. We propose two methods for setting the weight associated with the additional error component. In the fixed weighting approach, this weight is kept constant over the course of speech. In the adaptive weighting approach, it is adjusted according to the degree of dynamicity of speech segments. An objective evaluation shows that the newly derived MGE criterion with the adaptive weighting method results in comparable performance for the static feature and better performance for the delta feature compared with the baseline MGE criterion. Subjective listening tests exhibit a small but statistically significant improvement in the quality of speech synthesized by the proposed technique. The newly derived criterion improves the capability of HMMs in capturing dynamic properties of speech without increasing the computational complexity of the training process compared with the baseline criterion.
    Download PDF (438K)
  • Yasunari Obuchi, Ryu Takeda, Masahito Togami
    2013 Volume 34 Issue 2 Pages 133-141
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    In this paper, we propose a new noise suppression method, that is best used as a preprocessor for time-lag speech recognition. Assuming that a time lag of a few seconds is acceptable in various speech recognition applications, the proposed method is realized as a combination of forward and backward estimation flows over time. Each estimation flow is based on the optimally modified log spectral amplitude (OM-LSA) speech estimator, but a look-ahead estimation mechanism is additionally equipped to make the estimation more robust. Evaluation experiments using various databases confirm that the speech recognition accuracy can be greatly improved by adding the proposed method to the existing system.
    Download PDF (959K)
TECHNICAL REPORT
  • Takayuki Arai
    2013 Volume 34 Issue 2 Pages 142-146
    Published: February 01, 2013
    Released on J-STAGE: March 01, 2013
    JOURNAL FREE ACCESS
    We first compared a speech signal with two reverberations, normal reverberation and its time-reversed version, that have the same modulation transfer function. Results showed that intelligibility of speech with the time-reversed reverberation was significantly less than that with the normal reverberation. We then compared the results of human speech recognition (HSR) with those of automatic speech recognition (ASR) to see whether a similar tendency could be observed in both cases. Results showed the similar asymmetry in ASR, but we found the HSR was more tolerant even though reverberation becomes longer. Finally, we discussed factors of asymmetric temporal properties in speech production and perception that current speech recognizers do not have.
    Download PDF (284K)
ACOUSTICAL LETTER
feedback
Top