Journal of the Acoustical Society of Japan (E)
Online ISSN : 2185-3509
Print ISSN : 0388-2861
ISSN-L : 0388-2861
Volume 13, Issue 6
Displaying 1-13 of 13 articles from this issue
  • Yoh'ichi Tohkura
    1992Volume 13Issue 6 Pages 331-332
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    Download PDF (303K)
  • Biing-Hwang Juang, Shigeru Katagiri
    1992Volume 13Issue 6 Pages 333-339
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    We address in this paper one of the prominent problems in pattern recognition, namely minimization of classification/recognition error rate. We propose a unconventional approach and a new formulation of the problem aiming at directly achieving a minimum classification error performance. The approach is called discriminative training which differs from the traditional statistical pattern recognition approach in its objective. Unlike the Bayesianframework, the new method does not require estimation of prob-ability distributions which usually cannot be reliably obtained. The new method has been applied in various experimental studies with good results, some of which are high-lighted in the paper to demonstrate the effectiveness of the new method. A broad range of problems can benefit from this new formulation.
    Download PDF (3667K)
  • Takashi Komori, Shigeru Katagiri
    1992Volume 13Issue 6 Pages 341-349
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    Although many pattern classifiers based on artificial neural networks have been vigorous-ly studied, they are still inadequate from a viewpoint of classifyingdynamic (variable-and unspecified-duration) speech patterns. To cope with this problem, the generalized probabilistic descent method (GPD) has recently been proposed. GPD not only allows one to train a discriminative system to classify dynamic patterns, but also possesses a remarkable advantage, namely guaranteeing the learning optimality (in the sense of a probabilistic descent search). A practical implementation of this theory, however, remains to be evaluated. In this light, we particularly focus on evaluating GPD in designing a widely-used speech recognizer based on dynamic time warping distance-measurement. We also show that the design algorithm appraised in this paper can be considered a new version of learning vector quantization, which is incorporated with the dynamic programming. Experimental evaluation results in tasks of classifying syllables and phonemes clearly demonstrate GPD's superiority.
    Download PDF (3000K)
  • Shozo Makino, Mitsuru Endo, Toshio Sone, Ken'iti Kido
    1992Volume 13Issue 6 Pages 351-360
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper proposes a new phoneme recognition method based on the Learning Vector Quantization (LVQ2) algorithm proposed by Kohonen. We propose three versions of a modified training algorithm to overcome a shortcoming of the LVQ2 method. In the modified LVQ2 algorithm, p reference vectors are modified at the same time if the correct class is within the N-th rank where N is set to some constant. Using this al-gorithm, the phoneme recognition scores obtained by the modified LVQ2 algorithm were higher than those obtained by the original LVQ2 algorithm. Furthermore, we propose a segmentation and recognition method for phonemes in continuous speech. At first a likelihood matrix is computed using the reference vectors, where each row indicates the likelihood sequence of each phoneme and each column indicates the likeli-hood of all phonemes for each 10-ms unit. The optimum phoneme sequence is com-puted from the likelihood matrix using the DP with duration constraints. We applied this method to a multi-speaker-dependent phoneme recognition task for continuous speech uttered Bunsetsu by Bunsetsu. The phoneme recognition score was 85.5% for the speech samples in continuous speech.
    Download PDF (4850K)
  • Tatsuya Kawahara, Shuji Doshita
    1992Volume 13Issue 6 Pages 361-367
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    To improve discriminating ability of Hidden Markov Model (HMM), we have proposed to incorporate a classifier into HMM. In this paper, we make a comparative study of its discrete distribution version and continuous one. The classifier in discrete model discriminates the symbols that are passed to HMM, whereas the classifier in continuous model discriminates the HMM states and computes their output probabilities as classi-fication scores. Thus, the output probability in discrete model indicates the frequency of the symbol occurrence, while that in continuous model shows the reliability of the classification for a given input. We made experimental evaluation of the both types of HMM with the same classifier, changing its output characteristics. In phoneme recognition, discrete model was superior to continuous one. In word and sentence recognition, however, we found that really stochastic distribution of the output probabilities was significant regardless of the types of HMM.
    Download PDF (3300K)
  • Toru Imai, Akio Ando
    1992Volume 13Issue 6 Pages 369-378
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    Maximum likelihood estimation is commonly used as a training method of HMMs for speech recognition. Corrective training and several other discriminative training methods have been proposed in recent years in order to obtain more discriminative ability of HMMs than maximum likelihood estimation. This paper describes a new discriminative HMM learning algorithm which attempts to minimize an error function on all training data. The error function is so defined as to represent a “degree” of recognition error for training data. This algorithm searches optimal HMMs by perturbing HMM parameters iteratively. It is designated “A-learning” algorithm in this paper. It is applicable not only to discrete HMMs but also to continuous HMMs. It is experimentally shown that this algorithm yields better recognition results than the corrective training algorithm for 17 Japanese consonants.
    Download PDF (2175K)
  • Implementation details and experimental results
    David Rainton, Shigeki Sagayama
    1992Volume 13Issue 6 Pages 379-387
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper describes an implementation of minimum error training for continuous gaussian mixture density HMMs. Instead of maximising the conditional probability of producing a set of training data, as in the conventional HMM maximum likelihood approach, we train to minimise the number of recognition errors. The most important aspect of this work is the use of a first order differentiable “loss” function, the minimisation of which is directly related to the minimisation of the recognition error rate. The performance of the resulting minimum error HMMs was compared against that of conventional maximum likelihood HMMs in a continuous speech recognition task using the ATR 5, 240 Japanese word data base. The results were impressive. For example, for 10 mixture 5 state Baum-Welch trained HMMs, after minimum error training word error rates reduced from 20.6% to 3.0% on the closed training set and 23.2% to 13.2% on the open test set. Furthermore, 3 mixture minimum error HMMs performed better than 10 mixture maximum likelihood HMMs. In fact in every performance measure made the minimum error HMMs proved to be superior.
    Download PDF (3343K)
  • Sinobu Mizuta, Kunio Nakajima
    1992Volume 13Issue 6 Pages 389-393
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    In this paper, a training method for continuous mixture density HMMs, named optimal discriminative training (ODT), and its implementation for speech recognition in noise are described. ODT is one of corrective learning methods, applied to continuous mixture density HMMs, and these HMMs are especially useful for speaker-independent speech recognition. Under noisy environments, the recognition categories are liable to confuse, so by using ODT the improvement of recognition accuracy is more expected. Here, we describe the training algorithm of ODT, and the effects of ODT to improve the robustness for adverse environments by the word recognition experiments in noise.
    Download PDF (2432K)
  • Kiyoaki Aikawa
    1992Volume 13Issue 6 Pages 395-402
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper proposes a novel neural network architecture for phoneme-based speech recognition. The new architecture is composed of five time-warping sub-networks and an output layer which integrates the sub-networks. Each time-warping sub-network has a different time-warping function embedded between the input layer and the first hidden layer. A time-warping sub-network recognizes the input speech warping the time axis using its time-warping function. The network is called the Time-Warping Neural Network (TWNN). The purpose of this network is to cope with the temporal variability of acoustic-phonetic features. The TWNN demonstrates a higher phoneme recognition accuracy than a baseline recognizer composed of time-delay neural networks with a linear time alignment mechanism.
    Download PDF (2090K)
  • Yoshinaga Kato, Masahide Sugiyama
    1992Volume 13Issue 6 Pages 403-409
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper describes the application of Fuzzy Partition Models (FPMs) and their in-cremental training to continuous speech recognition. FPMs are neural networks with multiple input-output units. Since the outputs are non-negative and their sum is one, they can be regarded as the probabilities of recognizing input speech phonemes. Auto-matic incremental training is developed using the Viterbi alignment to adapt FPMs to continuous speech. The FPMs are retrained automatically by using speech data seg-mented by the Viterbi alignment. We combined FPMs with an LR parser (FPM-LR) and carried out experiments in continuous speech recognition. The recognition rate of the FPM-LR was higher than that of a Time-Delay Neural Network-LR (TDNNLR). Automatic incremental training was more effective with FPMs than with TDNNs.
    Download PDF (891K)
  • Jun-ichi Takami, Atsuhiko Kai, Shigeki Sagayama
    1992Volume 13Issue 6 Pages 411-418
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper describes a pairwise discrimination approach using artificial neural networks for robust phoneme recognition and its application to continuous speech recognition. Until now, it is known that classification-type neural networks show poor robustness against the difference in speaking rates between training data and testing data. To improve the robustness, we developed Pairwise Discriminant Time-Delay Neural Net-works (PD-TDNNs) by applying the principle of pair discrimination to a conventional Time-Delay Neural Network. In this approach, pair discrimination scores for all com-binations of two phonemes are calculated by PD-TDNNs, each of which has a less sharp discrimination boundary, and final phoneme candidates are decided by majority decision of the pair discrimination scores. Through phoneme and continuous speech recognition experiments, it was found that this approach performs better than the conventional TDNN.
    Download PDF (1176K)
  • Kentaro Kurinami, Masahide Sugiyama
    1992Volume 13Issue 6 Pages 419-427
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper proposes a new optimization technique for speaker mapping neutral network training using the minimal classification error criterion. Recently, neural network modeling has been widely applied to various fields of speech processing. Most neural network applications are classification tasks; however, one of the authors of this paper proposed a speaker mapping neural network as a non-linear continuous mapping application, and showed its effectiveness. On the other hand, the minimal classification error optimization technique has been proposed and applied to several recognition architectures. Since the conventional speaker mapping neural networks have been trained under the minimal distortion criteria, the minimal classification error optimization technique is expected to provide better speaker mapping neural networks. This paper describes the speaker mapping neural network, the minimal classification error optimization technique, derives the algorithm of the minimal classification error optimization technique in the speaker mapping neural network and investigates the relationship between the derived algorithm and the conventional Back Propagation algorithm. Vowel classification experiments are carried out, showing the effectiveness of the proposed algorithm.
    Download PDF (1293K)
  • AI class-description learning viewpoint
    Yoichi Takebayashi
    1992Volume 13Issue 6 Pages 429-439
    Published: 1992
    Released on J-STAGE: February 17, 2011
    JOURNAL FREE ACCESS
    This paper describes the learning mechanism employed in a highly efficient user-adaptive speech recognizer based on the subspace method for large vocabulary Japanese test input. Comparing the subspace-based learning system with the well-known AI learning system ARCH, the following points are made:(1) Subspace learning using covariance matrix modification and KL-expansion is a kind of class-description learning from examples, as found in ARCH. The subspace learning method focusses on feature extraction, which results in a powerful representation of pattern characteristics for each pattern class, but does not involve only pattern classification, unlike conventional pattern recognition methods.(2) The concepts of “Near-Miss, ” “Require-Link” and “Forbid-Link” in ARCH can be simulated with the subspace method. Since the subspace method deals with patterns but not symbols, it does not need pattern-symbol conversion. In other words, the subspace learning method has a more versatile description capability than ARCH.(3) Minsky's concept of “Uniframe” is implemented in a speech recognizer based on the subspace method. The “Uniframe” obtained with KL-expansion is equivalent to a subspace which represents a meaning of a class. Minsky's “Accumula-tion” and “Exceptional Principal” concepts have also been taken into account.
    Download PDF (1712K)
feedback
Top