Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 6, Issue 2
Displaying 1-8 of 8 articles from this issue
  • [in Japanese]
    1999 Volume 6 Issue 2 Pages 1-8
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (814K)
  • NOBUYASU ITOH, MASAFUMI NISHIMURA, SHIHO OGINO, KAZUTAKA YAMASAKI
    1999 Volume 6 Issue 2 Pages 9-27
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper deals with a word-based language model of Japanese. In Japanese, word boundaries are not stable and grammatical units do not necessarily coincide with human intuition. For accurate segmentation it is therefore necessary to create a vocabulary set that covers human utterance units. In our word-segmentation method, a model of word boundary is described by morphological parameters (i. e. part of speech), which are learned by comparing results of human segmentation with those of Japanese morphological analyzer. Then by using pseudo-random number and the model, it is determined whether each morpheme transition is a word boundary. As a result, we obtain a vocabulary set and learning data for Japanese language model automatically. According to our experiments using articles from three newspaper and appended texts in network-based forums, about 44, 000 words cover 94-98% of all words in the test data, and the average numbers of words per sentence are 12-19% smaller than those of morphemes. The parameters of word segmentation model and language model are quite different in newspaper articles and forum's texts. However, the difference does not exist in the probabilities of common events, but in the kinds of events. Therefore the language model, which was created from newspaper articles and forum's text, gave the satisfactory results for both test set.
    Download PDF (1779K)
  • HIROKI MORI, HIROTOMO Aso, SHOZO MAKINO
    1999 Volume 6 Issue 2 Pages 29-40
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper proposes a novel, knowledge-free language model with great ability in reducing ambiguity. This model is defined as n-gram of string which is referred to “superword, ” and belongs to a superclass of traditional word or string n-gram models'class. The concept of superword is based on only one principle-repetitionality in training text. The probabilistic distribution of the model is learned through the forward-backward algorithm. Experimental results showed that the performance of superword model combined with character trigram model was superior to the traditional word model based on morphological analysis, and traditional string-based model.
    Download PDF (1152K)
  • HIROKAZU MASATAKI, YOSHINORI SAGISAKA
    1999 Volume 6 Issue 2 Pages 41-57
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper, Japanese morphological analyzer is proposed using composite part-of-speech (POS) and morpheme sequence N-gram (Composite N-gram). Composite N-gram is a N-gram type language model whoes unit is POS class, morpheme and morpheme-sequence, which can give an excellent prediction ability from small corpus. In order to apply unknown words, we improved the composite N-gram by considering the probability that unknown word appears from POS class. Experimental results showed that morpheme accuracy using composite N-gram reached a maximum of 99.17%, which was better than using conventional rule based method. Considering the pronounciation to the evaluation, the accuracy was 98.68%. When applied to sentences including unknown words, the fall of the morpheme accuracy was only about 0.8%.
    Download PDF (1794K)
  • HISAKO ASANO, KOJI MATSUOKA, SHINICHIRO TAKAGI, HISASHI OHARA
    1999 Volume 6 Issue 2 Pages 59-81
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In order for Japanese text-to-speech synthesis to provide highly natural synthesized speech, it is necessary to correctly generate reading-and-prosodic information, that is, information about readings, accents, pauses, and so on. This paper describes a method of generating reading-and-prosodic information that uses morphological analysis based on the multi-level analysis method, which deeply analyses compound words and heteronyms; also described is the word dictionary information used in the method. The main characteristics of this generation method are: (1) long unit word recognition in the morphological analysis to cope with generating reading-andprosodic information, (2) accentual phrase assignment using semantic dependent relationships in compound words, (3) pause insertion based on multi-level assignment using local structures in compound words and connected power of accentual phrases instead of dependent relationships of syntactic phrases. In an evaluation for news-texts, this method generated reading-and-prosodic information with 95% accuracy for closed data and 91% accuracy for open data. These results show the effectiveness of this method.
    Download PDF (3639K)
  • TOSHIYUKI TAKEZAWA, TSUYOSHI MORIMOTO
    1999 Volume 6 Issue 2 Pages 83-95
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    The utterance units that serve as input to speech translation and/or spoken dialogue systems that handle spontaneous speech are not always sentences. However, the processing units of language translation are sentences. Since we do not have enough knowledge about the sentences of spoken languages, we use the term “meaningful chunks” instead of sentences. First, using conventionally interpreted dialogue data, we show that utterance units sometimes need to be divided into several meaningful chunks, and sometimes need to be connected to make up a single meaningful chunk. Next, we propose a method of transforming from utterance units to meaningful chunks based on pause information and the N-gram of fine-grained part-of-speech subcategories. We have conducted experiments and have confirmed that our method yields good results.
    Download PDF (1312K)
  • SEIICHI NAKAGAWA, HIROTAKA AKAMATSU, HIROMITSU NISHIZAKI
    1999 Volume 6 Issue 2 Pages 97-115
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper, we describe a method that constructs language models using a taskadaptation strategy and idiomatic expressions of news articles. To build an effective N-gram based language model, it should be noted that the training data must be prepared as much as possible. However, for a given task/topic, it is very difficult to gather much data. First, we investigated the effect of a task adaptation method of N-gram language model using a limited amount of target articles. Second, we investigated the effect of the language model adaptation method using the latest articles. Third, we investigated the effect of the use of idiomatic expressions as morpheme units, since some specific expressions and idiomatic expressions are frequently observed in news articles. We show our proposed three methods are effective for constructing N-gram language models.
    Download PDF (3630K)
  • NAHOKO SATO, YUICHI KOJIMA, MASAKO MOTINUSHI, MASAYUKI KAMEDA
    1999 Volume 6 Issue 2 Pages 117-132
    Published: January 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This is a practice of method which uses results of grammatical dependency analysis within a text-to-speech conversion system to insert pauses. In order to understand the meaning of text which is converted into speech, the appropriate insertion of pauses into phrase boundaries should be required. In previous studies, several approaches using the results of simple analyses such as morphological analysis or analysis of adjoining phrases to determine the position and length of pauses was proposed. In those studies, we often found the speech had inappropriately determined pause positions. In the present method, we introduce a quick Japanese parser into the text analysis, determine the distance and relationship between dependent phrases, and use this information to determine the position and length of pauses. The distance and relationship between dependent phrases is translated into the cost of pause insertion. Each phrase boundary has a cost of pause insertion, and appropriate pauses are inserted according to the costs. To test the validity of our method, we implemented it in a text-to-speech conversion system, and compared the proposed method with a previous pause insertion method which used simple analyses such as morphological analysis or analysis of adjoining phrases. The results confirmed that our proposed method fulfilled our expectations.
    Download PDF (2792K)
feedback
Top