Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 7, Issue 2
Displaying 1-9 of 9 articles from this issue
  • [in Japanese]
    2000 Volume 7 Issue 2 Pages 1-2
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (194K)
  • TAKAKO TSUJI, MASAO FUKETA, KAZUHIRO MORITA, JUN-ICHI AOE
    2000 Volume 7 Issue 2 Pages 3-26
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Although there are many kinds of research about text classification based on term information in the whole text, humans can recognize the field of a text by finding a small number of specific words in it. In this paper, such terms are called a field association (FA) term that can be directly related to the field of a text. It is possible to collect single-word FA terms because the number is finite, but there are some difficulties: how to select useful compound FA terms from a huge number of combinations of single-word FA terms. For FA terms, five association levels are defined and two kinds of ranks based on stability and inheritance are presented. Redundant candidates of compound FA terms can be removed remarkably by using the level and the rank. From the simulation results of 180 fields' Japanese text files, it turns out that the total number 88, 782 of candidates for compound FA terms can be reduced to 8, 405 which is about 9% to the original and that recall and precision are more than 0.77 and 0.90, respectively. From the experimental results of field determination using FA terms for 264 fragments of texts, it is shown that the accuracy by the presented method attains more than 90%, and that is about 30% higher than the case where only single-word FA terms are used.
    Download PDF (2458K)
  • TAKEHIKO YOSHIMI, ICHIKO SATA
    2000 Volume 7 Issue 2 Pages 27-43
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Since the headlines of English news articles have a characteristic style, different from the styles which prevail in ordinary sentences, it is difficult for MT systems to generate high quality translations for headlines. We try to solve this problem by adding to an existing system a preediting module which rewrites the headlines to ordinary expressions. Rewriting of headlines makes it possible to generate better translations which would not otherwise be generated, with little or no changes to the existing parts of the system. While most MT systems would not probably accept, for example, the headline “Sales up sharply in June”, they would be able to generate a satisfactory translation of the expression “Sales were up sharply in June” where the verb “were” has been inserted. Focusing on a conspicuous phenomenon, the absence of a form of the verb of ‘be’, we have described rewriting rules for putting properly the verb‘be’ into the headlines, based on information obtained by morpholexical and rough syntactic analysis. We have incorporated the proposed method into our English-to-Japanese MT system Power E/J, and carried out an experiment with 312 headlines as unknown data. Our method has satisfactorily marked 81.2% recall and 92. 0% precision.
    Download PDF (1812K)
  • MASAKI HISANO
    2000 Volume 7 Issue 2 Pages 45-61
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Using 1991-1997 Mainichi Shimbun Newspaper's CD-ROMs containing about 340 million characters, nature of character use was explored. Significant differences of mean occurrence rates among 16 page-types (e. g., editorial, sports, local) were observed in 69.2%of 5, 726 character types that covered all the cases in the corpus except a space character. Similarly, 20.3%of those showed significant month-level (seasonal) variations of occurrence rates, and 43.9% showed significant year-level variations (trends). Limited to frequent 2, 732 character types that severally accounted for more than 0.001%o of the corpus, these tendencies became more clearly: rates of character types that showed significant variations of occurrence rates by page-types, month-levels, and year-levels were 98.4, 33.5, and 76.0%respectively. These results suggest that there could be a vast range of systematic variations in lexical use, which have been overlooked in simple summing-up of mass corpuses.
    Download PDF (1858K)
  • KIYOTAKA UCHIMOTO, QING MA, MASAKI MURATA, HIROMI OZAKU, MASAO UTIYAMA ...
    2000 Volume 7 Issue 2 Pages 63-90
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper describes a system for extracting named entities. The system is based on a ME (maximum entropy) model and transformation rules. Eight types of named entities are defined by IREX-NE, and each named entity consists of one or more morphemes, or it includes a substring of a morpheme. We define 40 named entity labels, which are at the beginning, the middle, or the end of a named entity, and extract a named entity which consists of one or more morphemes by estimating the labels according to the ME model. The trained ME model detects the relationship between features and named entity labels assigned to morphemes. The features are clues used for estimating labels. We use information about lexical items and parts-of-speech as features in the target morpheme. We also use information about lexical items and parts-of-speech in four morphemes, two on the left and two on the right of the target morpheme, as features. After estimating the named entity labels according to the ME model, we extract a named entity, which includes a substring of a morpheme, by using transformation rules. These rules are automatically acquired by investigating the difference between named entity labels in a tagged corpus and those extracted by our system from the same corpus without tags. This paper also evaluates the relationships between transformation rules and accuracy, between features and accuracy, and between the amount of training data and accuracy by conducting several comparative experiments.
    Download PDF (2744K)
  • MASAO UTIYAMA, MASAKI MURATA, QING MA, KIYOTAKA UCHIMOTO, HITOSHI ISAH ...
    2000 Volume 7 Issue 2 Pages 91-116
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper describes a statistical approach to the interpretation of metonymy. In metonymy, the name of one thing (the source) is substituted for that of another related to it (the target). For example, in Souseki wo gornu (read a Souseki), the source'a Souseki'is substituted for the target'a novel written by Souseki. In this case, they are related by an Artist for Artform relation. The method in this paper follows the procedure below in interpreting a metonymy.
    (1) Given a metonymynoun A, case-marker R, predicate V, nouns related to the source A are collected in a corpus.
    (2) From the collected nouns, a candidate for the target that satisfies the constraints imposed by R and V is selected by applying a statistical criterion.
    The method was tested experimentally. It was shown that a corpus is valuable for extracting nouns that are related to a given source and it was also shown that the proposed statistical criterion can select a good target from the extracted nouns. The precision of the experiment, when based on a rigorous judgment, was 0.47 and when based on a less rigorous judgment it was 0.65. The effectiveness of the proposed method has thus been demonstrated.
    Download PDF (2688K)
  • Rila Mandala, Takenobu Tokunaga, Hozumi Tanaka
    2000 Volume 7 Issue 2 Pages 117-140
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper proposes the use of multiple thesaurus types for query expansion in information retrieval. Hand-crafted thesaurus, corpus-based co-occurrence-based thesaurus and syntactic-relation-based thesaurus are combined and used as a tool for query expansion. A simple word sense disambiguation is performed to avoid misleading expansion terms. Experiments using TREC-7 collection proved that this method could improve the information retrieval performance significantly. Failure analysis was done on the cases in which the proposed method fail to improve the retrieval effectiveness. We found that queries containing negative statements and multiple aspects might cause problems in the proposed method.
    Download PDF (2161K)
  • MASAKI MURATA, QING MA, KIYOTAKA UCHIMOTO, HIROMI OZAKU, MASAO UTIYAMA ...
    2000 Volume 7 Issue 2 Pages 141-160
    Published: April 10, 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Robertson's 2-poisson model to retrieve information does not use location information and category information. We constructed a framework using location information and category information in a 2-poisson model. We submitted two systems based on this framework at the IREX contest. Their precisions in the A-judgement measure were 0.4926 and 0.4827, the highest values among the 15 teams and 22 systems that participated in the IREX contest. This paper describes our systems and comparative experiments when various parameters are changed. These experiments confirmed the effectiveness of location information and category information.
    Download PDF (2008K)
  • 2000 Volume 7 Issue 2 Pages 161
    Published: 2000
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (20K)
feedback
Top