Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 6, Issue 3
Displaying 1-10 of 10 articles from this issue
  • [in Japanese]
    1999 Volume 6 Issue 3 Pages 1-2
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (233K)
  • Virach Sornlertlamvanich, Kentaro Inui, Hozumi Tanaka, Takenobu Tokuna ...
    1999 Volume 6 Issue 3 Pages 3-22
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper shows the empirical results of our probabilistic GLR parser based on a new probabilistic GLR language model (PGLR) against existing models based on the same GLR parsing framework, namely the model proposed by Briscoe and Carroll (B & C), and two-level PCFG or pseudo context-sensitive grammar (PCSG) which is claimed to be a context-sensitive version of PCFG. We evaluate each model in character-based parsing (morphological and syntactic analysis) tasks, in which we have to consider the word segmentation and multiple part-of-speech problems. Parsing a sentence from the morphological level makes the task much more complex because of the increase of parse ambiguity stemming from word segmentation ambiguities and multiple corresponding sequences of parts-of-speech. As a result of the well-founded probabilistic nature of PGLR, the model accurately incorporates probabilities for word prediction, by way of encoding pre-terminal n-gram constraints into LR parsing tables. The PGLR model empirically outperforms the other two models in all measures, on experimentation with the ATR Japanese corpus. To examine the appropriateness of PGLR using an LALR table, we test the PGLR model using both an LALR and CLR table. The results show that parsing with the PGLR model using LALR table returns the best performance in parse accuracy, parsing time and memory space consumption.
    Download PDF (1792K)
  • TETSUO ARAKI, SATORU IKEHARA, NAOTO MISHINA
    1999 Volume 6 Issue 3 Pages 23-41
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper proposes a method to detect self-repairs strings included in spontaneous speech, by using n-gram model and Markov models of syllable chains. This method is comprised of the following two steps: The first is to detect the candidates for selfrepair strings using n-gram models. The second is to evaluate the values of Markov chain probability of a sentence in which the candidates of self-repair strings are included. This method is applied to detect self-repair strings in ATR dialogue corpus. From the experiment, it is confirmed that the method is effective to detect self-repair strings inserted in an arbitrary positions of sentences.
    Download PDF (6395K)
  • HAJIME MOCHIZUKI, TAKEO HONDA, MANABU OKUMURA
    1999 Volume 6 Issue 3 Pages 43-58
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In general, a text consists of multiple sentences, and there are some semantic relations among them. A certain range of sentences in a text, is widely assumed to form a coherent unit which is usually called a discourse segment. While sentences in a segment have semantic relations with each other, segments in a discourse have some relations with each other. The global discource structure of a text can be constructed by relating the segments with each other. Therefore, identifying the segment boundaries is a first step to recognize the structure of a text. There are many surface linguistic cues which help for identifing text segmentations in a text. In this paper, we describe a method for identifying segment boundaries of a Japanese text with the aid of multiple surface linguistic cues, though our experiments might be small-scale. We calculate a weighted sum of the scores for all cues that reflects their contribution to identifying the correct segment boundaries. We also present a method of training the weights for multiple linguistic cues automatically without the overfitting problem.
    Download PDF (1581K)
  • SATOSHI SEKINE, KIYOTAKA UCHIMOTO, HITOSHI ISAHARA
    1999 Volume 6 Issue 3 Pages 59-73
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Dependency analysis is regarded as a standard method of Japanese syntactic analysis. As dependencies normally go from left to right, it is effective to parse from right to left, as we can analyze predicates first. There have been several proposals for such methods using rule based parsing. In this paper, we will propose a Japanese dependency analysis which combines right to left parsing and a statistical method. It performs a beam search, an effective way of limiting the search space for right to left parsing. We get a dependency accuracy of 87.2% and a sentence accuracy of 40.8% using the Kyoto University corpus. Varying the beam search width, we observed that the best performances were achieved when the width is small. Actually, 95% of the sentence analyses obtained with beam width=1 were the same as the best analyses with beam width=20. The N-best sentence accuracy for N=20 was 78.5%. The analysis speed was proportional to the square of the sentence length (number of segments), as predicted for the algorithm. The average analysis time was 0.03 second (average sentence length was 10.0) and it took 0.29 second to analyze the longest sentence, which has 41 segments.
    Download PDF (1254K)
  • Kozo KIKUCHI, YUKIHIRO ITOH
    1999 Volume 6 Issue 3 Pages 75-99
    Published: April 10, 1999
    Released on J-STAGE: June 07, 2011
    JOURNAL FREE ACCESS
    Improving the accuracy of dependency analysis in long Japanese sentences is a big nroblem in natural language analysis. Recently, there is a tendency to determine the dependency on the basis of a statistical probability by using the results of the analysis of a large amount of corpus. In this paper we research the dependency of adnominals which contain an I-adjective or NA-adjective using about 4400 sentences extracted from a large amount of technological documents and newspapers. To determine the dependency, we analyze the target sentences which include the adjective in the form of “[noun-1]+[adjective]+[noun-2]” considering the relationship of each word. As a result, we find seven effective dependency rules which are classified into the following three patterns:
    determined only from preceding or following nouns.
    determined from the relation between preceding or following nouns and adjective.
    determined from the characteristic of adjective itself.
    Download PDF (2213K)
  • HAJIME MOCHIZUKI, MAKOTO IWAYAMA, MANABU OKUMURA
    1999 Volume 6 Issue 3 Pages 101-126
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    The importance of document retrieval systems which can retrieve relevant documents for user's needs is now increasing with the growing availability of full-text documents. In the traditional document retrieval, each document is treated as a single unit. However, since long documents tend to contain various topics, the passage-level retrieval has been received more attention in the recent document retrieval. The passage can be considered as a sequent part of the document which contains a content related to a content of the query. In the passage retrieval, it is a problem how to decide the passages. It is considered that the passages which form coherent semantic units can effectively improve the accuracy. Furthermore, it may be also effective that the size and position of passage can be flexibly changed determining on the query and the document. In this paper, we describe a better method for calculating the passages that uses lexical chains. We will also present a passage-level document retrieval method which improves the accuracy.
    Download PDF (5049K)
  • SHINICHI ANDO, YVES LEPAGE
    1999 Volume 6 Issue 3 Pages 127-143
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We present a linguistic structure analysis method using a treebank. This method falls under example-based approaches to natural language processing. It, however, analyzes an input sentence by analogy, which is based on a certain relationship among the input and similar sentences in treebank. Our method directly utilizes examples in a treebank, so it is easy to integrate the method with another parsing method and it may be helpful for disambiguation in parsing. In addition, the method goes without any dictionaries, so it can work robustly even if there are some unknown words in input sentences. Some experiments using Penn Treebank shows that correct analyses are produced for about 70% of input sentences, and the analyses similar to the correct ones are generated for other sentences.
    Download PDF (1556K)
  • Katerina T. Frantzi, Sophia Ananiadou
    1999 Volume 6 Issue 3 Pages 145-179
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper we present a domain-independent method for the automatic extraction of multi-word (technical) terms, from machine-readable special language corpora. The method, (C-value/NC-value), combines linguistic and statistical information. The first part, C-value enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type of multi-word terms, the nested terms. Nested terms are those which also exist as substrings of other terms. The second part, NC-value, gives two things: 1) a method for the extraction of term context words (words that tend to appear with terms), 2) the incorporation of information from term context words to the extraction of terms. We apply the method to a medical corpus and compare the results with those produced by frequency of occurrence also applied on the same corpus. Frequency of occurrence was chosen for the comparison since it is the most commonly used statistical method for automatic term extraction to date. We show that using C-value we improve the extraction of nested multi-word terms, while using context information (NC-value) we improve the extraction of multi-word terms in general. In the evaluation sections, we give directions for the further improvement of the method.
    Download PDF (9325K)
  • MAKOTO IWAYAMA, TAKENOBU TOKUNAGA
    1999 Volume 6 Issue 3 Pages 181-198
    Published: April 10, 1999
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    The difficulty in processing long documents is due to the variety of topics they contain. Long documents such as technical papers and reports include more topics than do short documents such as news articles. Since each topic in a long document tends to be relevant to only a small portion of the document, conventional text categorization, which tries to assign predefined topics to the entire document, results in limited effectiveness. In this paper we study the use of probabilistic passage categorization, assigning predefined topics to each passage contained in a document. We show that the performance of passage categorization is superior to that of conventional text categorization especially for long documents. We also discuss possibility of applying passage categorization to topic-dependent text summarization, and show some preliminary experimental results.
    Download PDF (1665K)
feedback
Top