Journal of the Acoustical Society of Japan (E)

Preface to the special issue on “Fundamentals on Spoken Language Processing”

Kazuhiko Kakehi

1995Volume 16Issue 5 Pages 257-259
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.257

JOURNAL FREE ACCESS

Download PDF (488K)
Fundamental frequency contour modeling using HMM and categorical multiple regression technique

Toshiaki Fukada, Yasuhiro Komori, Takashi Aso, Yasunori Ohora

1995Volume 16Issue 5 Pages 261-271
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.261

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper proposes a novel fundamental frequency (F₀) contour modeling based on statistics aiming at text-to-speech. In the proposed modeling, the F₀ contour of a sentence is constructed by statistical minor phrase models. These models consist of a sophisticated integration of local models of normalized pitch patterns and global models of maxima and dynamic ranges. Hidden Markov Model (HMM) is introduced to determine the normalized pitch patterns (pitch-HMM). To determine the maximum and the dynamic range, categorical multiple regression technique (CMRT) is introduced. HMM is a good statistical model which directly represents the F₀ contours by several reliable states. Moreover, it is easy to take relative changes of the F₀ (ΔF₀) and phonetic environments into account. CMRT is a good statistical modeling technique which is able to deal with syntactic structures and acoustic events in a sentence simultaneously. Evaluation on the pitch-HMMs shows accent type identification rate of 91% and RMS error of 9.2 Hz. Evaluation on the maximum and the dynamic range models gives 0.901 and 0.835 for the multiple correlation coefficients, respectively. Finally, the result of the subjective evaluation indicates that the proposed modeling is superior to the conventional modeling.

View full abstract

Download PDF (1634K)
Rapid environment adaptation for speech recognition

Keizaburo Takagi, Hiroaki Hattori, Takao Watanabe

1995Volume 16Issue 5 Pages 273-281
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.273

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper proposes a rapid environment adaptation algorithm based on spectrum equalization (REALISE). In practical speech recognition applications, differences between training and testing environments often seriously diminish recognition accuracy. These environmental differences can be classified into two types: difference in additive noise and difference in multiplicative noise in the spectral domain. The proposed method calculates time-alignment between a testing utterance and the closest reference pattern to it, and then calculates the noise differences between the two according to the timealignment. Then, we adapt all reference patterns to the testing environment using the differences. Finally, the testing utterance is recognized using the adapted reference patterns. In a 250 Japanese word recognition task, in which the training and testing microphones were of two different types, REALISE improved recognition accuracy from 87% to 96%.

View full abstract

Download PDF (1459K)
Speaker individualities in speech spectral envelopes

Tatsuya Kitamura, Masato Akagi

1995Volume 16Issue 5 Pages 283-289
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.283

JOURNAL FREE ACCESS

Show abstractHide abstract

The aim of the three psychoacoustic experiments described here was to clarify whether there are speaker individualities in the spectral envelopes, in which frequency bands such individualities exist, and how frequency bands having speaker individualities can be manipulated. The LMA analysis-synthesis system was used to prepare stimuli varied specific frequency bands, and the frequency bands having speaker individualities were estimated experimentally. The results indicate that (1) speaker individualities exist in spectral envelopes, (2) these individualities are mainly at frequencies higher than 22 ERB rate (2212 Hz) and vowel characteristics exist from 12 ERB rate (603 Hz) to 22 ERB rate, and (3) the voice quality can be controlled by replacing the higher frequency band of one talker with that of other talkers. The replace point is the adjacent spectral local minimum below the spectral local maximum around 23 ERB rate in the spectral envelopes.

View full abstract

Download PDF (1163K)
Electromyographic study of focus and accent in Japanese

Kikuo Maekawa, Shigeru Kiritani, Hajime Hirose

1995Volume 16Issue 5 Pages 291-298
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.291

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, the physiological mechanism underlying the voice fundamental frequency (F₀) control was investigated at the phrasal level. A new method of correlation analysis between cricothyroid muscle activity and the resulting F₀ contour was proposed and applied to speech material varying in accentedness and focal conditions. Examinations of the difference between the observed F₀ contours and the F₀ contours estimated from the cricothyroid activity revealed interesting deviation tendencies that are related to the linguistic properties of speech material: accentedness; location of phrase in a sentence; and the presence vs. absence of focus. Another interesting finding was the strong suppression of the sternohyoid muscle activity under focus. The suppression was stronger in unaccented phrases than in accented ones. An interpretation of the suppression and its relationship to accent was proposed based on the notion of laryngeal state function proposed in Atkinson (1978).

View full abstract

Download PDF (1133K)
Clustered inter-phrase or word context-dependent models for continuously read Japanese

Kazuhiro Kondo, Yu-Hung Kao, Barbara Wheatley

1995Volume 16Issue 5 Pages 299-310
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.299

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper investigates methods to model inter-phrase or word context for continuous Japanese speech recognition. It is well known that in continuous speech, coarticulation between words or phrases induces allophonic variation of the beginning and ending phones in words or phrases. It was found that by compiling a network of contextdependent phonetic models which models these inter-word or inter-phrase context, recognition error reduction by 32 % can be achieved compared to models which do not account for inter-word context with task-dependent training, i.e. models that were trained with the same vocabulary as the test set. A more dramatic error reduction of up to 43% was possible with task-independent training. However, this will significantly increase the number of phonetic models required to model the vocabulary. With digit models, the increase in the number of models is 4 to 5 fold. To overcome this increase, we clustered the inter-word/phrase context into a few phonetic classes. Using one class for consonant inter-word context and two classes for vowel context, the recognition accuracy on digit string recognition was found to be virtually equal to the accuracy with unclustered models, while the number of phonetic models required was reduced by more than 50%.

View full abstract

Download PDF (1592K)
Role of prosodic features in the human process of perceiving spoken words and sentences in Japanese

Nobuaki Minematsu, Keikichi Hirose

1995Volume 16Issue 5 Pages 311-320
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.311

JOURNAL FREE ACCESS

Show abstractHide abstract

Although prosodic features of speech are known to play an important role in the transmission of linguistic information, experiments are rather rare on the quantitative analysis for their roles in the speech perception process. As a step toward the clarification and formulation of the process, three perceptual experiments were conducted. In the first experiment, synthetic speech of isolated words were generated after accent type manipulation. Results showed that the prosodic features are important for word perception especially for the case of type 1 accent. The gating paradigm was applied to natural word utterances in the second experiment. Using the gated utterances as stimuli, the minimum period required for the correct identification was investigated for words with each accent type. Results showed that, utilizing the prosodic features, the perception of words with type 1 accent completes earlier than that of words with other accent types. In the last experiment, sentence stimuli were synthesized after manipulating phrase and accent components of the fundamental frequency contour. Results showed that a phrase component, even with a small command magnitude, can group words in a phrase unit, and, thus, can work as a cue for detecting syntactic structures.

View full abstract

Download PDF (1615K)
Effect of ultrasound on diffusion separation of electrolyte through vinylon membrane

Hui Li, Masao Ide

1995Volume 16Issue 5 Pages 321-323
Published: 1995
Released on J-STAGE: February 17, 2011

DOIhttps://doi.org/10.1250/ast.16.321

JOURNAL FREE ACCESS

Download PDF (406K)

Register with J-STAGE for free!