Journal of the Acoustical Society of Japan (E)
Online ISSN : 2185-3509
Print ISSN : 0388-2861
ISSN-L : 0388-2861
Volume 16, Issue 5
Displaying 1-8 of 8 articles from this issue
  • Kazuhiko Kakehi
    1995Volume 16Issue 5 Pages 257-259
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    Download PDF (488K)
  • Toshiaki Fukada, Yasuhiro Komori, Takashi Aso, Yasunori Ohora
    1995Volume 16Issue 5 Pages 261-271
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper proposes a novel fundamental frequency (F0) contour modeling based on statistics aiming at text-to-speech. In the proposed modeling, the F0 contour of a sentence is constructed by statistical minor phrase models. These models consist of a sophisticated integration of local models of normalized pitch patterns and global models of maxima and dynamic ranges. Hidden Markov Model (HMM) is introduced to determine the normalized pitch patterns (pitch-HMM). To determine the maximum and the dynamic range, categorical multiple regression technique (CMRT) is introduced. HMM is a good statistical model which directly represents the F0 contours by several reliable states. Moreover, it is easy to take relative changes of the F0F0) and phonetic environments into account. CMRT is a good statistical modeling technique which is able to deal with syntactic structures and acoustic events in a sentence simultaneously. Evaluation on the pitch-HMMs shows accent type identification rate of 91% and RMS error of 9.2 Hz. Evaluation on the maximum and the dynamic range models gives 0.901 and 0.835 for the multiple correlation coefficients, respectively. Finally, the result of the subjective evaluation indicates that the proposed modeling is superior to the conventional modeling.
    Download PDF (1634K)
  • Keizaburo Takagi, Hiroaki Hattori, Takao Watanabe
    1995Volume 16Issue 5 Pages 273-281
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper proposes a rapid environment adaptation algorithm based on spectrum equalization (REALISE). In practical speech recognition applications, differences between training and testing environments often seriously diminish recognition accuracy. These environmental differences can be classified into two types: difference in additive noise and difference in multiplicative noise in the spectral domain. The proposed method calculates time-alignment between a testing utterance and the closest reference pattern to it, and then calculates the noise differences between the two according to the timealignment. Then, we adapt all reference patterns to the testing environment using the differences. Finally, the testing utterance is recognized using the adapted reference patterns. In a 250 Japanese word recognition task, in which the training and testing microphones were of two different types, REALISE improved recognition accuracy from 87% to 96%.
    Download PDF (1459K)
  • Tatsuya Kitamura, Masato Akagi
    1995Volume 16Issue 5 Pages 283-289
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    The aim of the three psychoacoustic experiments described here was to clarify whether there are speaker individualities in the spectral envelopes, in which frequency bands such individualities exist, and how frequency bands having speaker individualities can be manipulated. The LMA analysis-synthesis system was used to prepare stimuli varied specific frequency bands, and the frequency bands having speaker individualities were estimated experimentally. The results indicate that (1) speaker individualities exist in spectral envelopes, (2) these individualities are mainly at frequencies higher than 22 ERB rate (2212 Hz) and vowel characteristics exist from 12 ERB rate (603 Hz) to 22 ERB rate, and (3) the voice quality can be controlled by replacing the higher frequency band of one talker with that of other talkers. The replace point is the adjacent spectral local minimum below the spectral local maximum around 23 ERB rate in the spectral envelopes.
    Download PDF (1163K)
  • Kikuo Maekawa, Shigeru Kiritani, Hajime Hirose
    1995Volume 16Issue 5 Pages 291-298
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    In this paper, the physiological mechanism underlying the voice fundamental frequency (F0) control was investigated at the phrasal level. A new method of correlation analysis between cricothyroid muscle activity and the resulting F0 contour was proposed and applied to speech material varying in accentedness and focal conditions. Examinations of the difference between the observed F0 contours and the F0 contours estimated from the cricothyroid activity revealed interesting deviation tendencies that are related to the linguistic properties of speech material: accentedness; location of phrase in a sentence; and the presence vs. absence of focus. Another interesting finding was the strong suppression of the sternohyoid muscle activity under focus. The suppression was stronger in unaccented phrases than in accented ones. An interpretation of the suppression and its relationship to accent was proposed based on the notion of laryngeal state function proposed in Atkinson (1978).
    Download PDF (1133K)
  • Kazuhiro Kondo, Yu-Hung Kao, Barbara Wheatley
    1995Volume 16Issue 5 Pages 299-310
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper investigates methods to model inter-phrase or word context for continuous Japanese speech recognition. It is well known that in continuous speech, coarticulation between words or phrases induces allophonic variation of the beginning and ending phones in words or phrases. It was found that by compiling a network of contextdependent phonetic models which models these inter-word or inter-phrase context, recognition error reduction by 32 % can be achieved compared to models which do not account for inter-word context with task-dependent training, i.e. models that were trained with the same vocabulary as the test set. A more dramatic error reduction of up to 43% was possible with task-independent training. However, this will significantly increase the number of phonetic models required to model the vocabulary. With digit models, the increase in the number of models is 4 to 5 fold. To overcome this increase, we clustered the inter-word/phrase context into a few phonetic classes. Using one class for consonant inter-word context and two classes for vowel context, the recognition accuracy on digit string recognition was found to be virtually equal to the accuracy with unclustered models, while the number of phonetic models required was reduced by more than 50%.
    Download PDF (1592K)
  • Nobuaki Minematsu, Keikichi Hirose
    1995Volume 16Issue 5 Pages 311-320
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    Although prosodic features of speech are known to play an important role in the transmission of linguistic information, experiments are rather rare on the quantitative analysis for their roles in the speech perception process. As a step toward the clarification and formulation of the process, three perceptual experiments were conducted. In the first experiment, synthetic speech of isolated words were generated after accent type manipulation. Results showed that the prosodic features are important for word perception especially for the case of type 1 accent. The gating paradigm was applied to natural word utterances in the second experiment. Using the gated utterances as stimuli, the minimum period required for the correct identification was investigated for words with each accent type. Results showed that, utilizing the prosodic features, the perception of words with type 1 accent completes earlier than that of words with other accent types. In the last experiment, sentence stimuli were synthesized after manipulating phrase and accent components of the fundamental frequency contour. Results showed that a phrase component, even with a small command magnitude, can group words in a phrase unit, and, thus, can work as a cue for detecting syntactic structures.
    Download PDF (1615K)
  • Hui Li, Masao Ide
    1995Volume 16Issue 5 Pages 321-323
    Published: 1995
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    Download PDF (406K)
feedback
Top