Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 10, Issue 1
Displaying 1-7 of 7 articles from this issue
  • [in Japanese]
    2003 Volume 10 Issue 1 Pages 1-2
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (240K)
  • CHIKASHI NOBATA, SATOSHI SEKINE, JUN'ICHI TSUJII
    2003 Volume 10 Issue 1 Pages 3-26
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We propose indices to measure the difficulty of the named entity (NE) task by looking at test corpora, based on expressions inside and outside the NEs. These indices are intended to estimate the difficulty of each task without actually using an NE system and to be unbiased towards a specific system. The values of the indices are compared with the systems' performance in Japanese documents. We also discuss the difference between NE classes with the indices and show useful clues which will make it easier to recognize NEs.
    Download PDF (2107K)
  • HIROSHI NAKAGAWA, HIROAKI YUMOTO, TATSUNORI MORI
    2003 Volume 10 Issue 1 Pages 27-45
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper, we propose a new idea of automatically recognizing domain specific terms from monolingual corpus. The majority of domain specific terms are compound nouns that we aim at extracting. Our idea is based on single-noun statistics calculated with single-noun bigrams. Namely we focus on how many nouns adjoin the noun in question to form compound nouns. In addition, we combine this measure and frequency of each compound nouns and single-nouns, which we call FLR method. We experimentally evaluate these methods on NTCIR1 TMREC test collection. As the results, when we take into account less than 1, 400 or more than 12, 000 highest term candidates, FLR method performs best.
    Download PDF (6325K)
  • HIROYUKI NAKAJIMA
    2003 Volume 10 Issue 1 Pages 47-61
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Relevance feedback is a method to achieve accurate information retrieval through the application of both evaluated relevant and irrelevant sample documents which are chosen based on an initial query by the system. Retrieval accuracy achieved by relevance feedback changes according to the sample selection methodology for user judgement. Relevance sampling, a method which is often utilized for choosing samples, asks users to label the sample documents which are classified as most likely to be relevant. On the other hand, uncertainty sampling, a method which selects samples according to an unclear classification system, has been reported to be more effective than the original sampling method. However, both sampling methods may select multiple numbers of similar documents, and thus in both cases, the retrieval accuracy should be improved. In this paper, we propose ‘unfamiliar sampling’, a method which evaluates the distance between each pair of documents and removes a candidate for sampling if it is the nearest neighbor of any sample that has already been selected. With this procedure, not only multiple numbers of similar documents are not used for sampling, but, moreover, retrieval accuracy improves. In applying relevance feedback for document retrieval, it is important to achieve high retrieval accuracy with a small number of sample documents. Also, in this paper, we propose ‘Rocchio-Boost’, which applies Rocchio feedback as a weekly learner of AdaBoost. Furthermore, we show that it can achieve high retrieval accuracy. Empirical results on NPL test collection show that the proposed methods improve the average precision of retrieval by 6% over the original relevance sampling and Rocchio feedback.
    Download PDF (1485K)
  • EIKO YAMAMOTO, YOSHIYUKI TAKEDA, KYOJI UMEMURA
    2003 Volume 10 Issue 1 Pages 63-80
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper, we propose a measure for information retrieval (IR). This measure is tolerant for morphological variation. When various persons describe the data to retrieve, their notations may vary even if the data describe the same topic. This variation prevents system to retrieve all of relevant documents for the input sentence. Although human can handle this variation, computers usually can not handle this. Edit distance is a well-known measure that can cope with this variation. We have used this measure for information retrieval and found that its precision is poor. Therefore, we propose to modify this similarity measure to be suitable for information retrieval. We show that this extension improves the performance. We also compared the proposed similarity measure with the popular similarity measures used in many information retrieval systems.
    Download PDF (2023K)
  • TSUTOMU HIRAO, HIDETO KAZAWA, HIDEKI ISOZAKI, EISAKU MAEDA, YUJI MATSU ...
    2003 Volume 10 Issue 1 Pages 81-108
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Due to the rapid growth of the Internet and the emergence of low-price and largecapacity storage devices, the number of online documents is exploding. Automatic summarization is the key handling this situation. The cost of manual work demands that we be able to summarize a document set related to a certain event. This paper proposes a method of extracting important sentences from document sets. The method is based on Support Vector Machines, a technology that is attracting attention in the field of natural language processing. We conducted experiments using three document sets formed from twelve events published in the MAINICHI newspaper of 1999. These sets were manually processed by newspaper editors. Tests using this corpus show that our method has better performance than either the Lead-based method or the TF-IDF method. Moreover, we clarify that reducing redundancy is not always effective for extracting important sentences from a set of multiple documents taken from a single source.
    Download PDF (2787K)
  • TATSUMI YOSHIDA, KIYONORI OHTAKE, KAZUHIDE YAMAMOTO
    2003 Volume 10 Issue 1 Pages 109-131
    Published: January 10, 2003
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We will report performances of currently and publicly available Chinese analyzers and resources. We use YamCha, a tool based on Support Vector Machines, and the Penn Chinese Treebank as a language resource. Combining these two, we measure the performances of Chinese analysis, i. e., word segmentation, part-of-speech tagging, and base phrase chunking. In the experiment of word segmentation and part-of-speech tagging, we also report the performance of MOZ, a statistical morphological analyzer, which is also available to the public. We found that the accuracy of morphological analysis using YamCha attains around 88%, which is over 4% higher than that of MOZ, although it is computationally very expensive. We also found that the accuracy for base phrase chunking is approximately 93%.
    Download PDF (2118K)
feedback
Top