Journal of Natural Language Processing

[title in Japanese]

[in Japanese]

2003 Volume 10 Issue 1 Pages 1-2
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.1

JOURNAL FREE ACCESS

Download PDF (240K)
Analysis on Difficulty Indices for Japanese Named Entity Task

CHIKASHI NOBATA, SATOSHI SEKINE, JUN'ICHI TSUJII

2003 Volume 10 Issue 1 Pages 3-26
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.3

JOURNAL FREE ACCESS

Show abstractHide abstract

We propose indices to measure the difficulty of the named entity (NE) task by looking at test corpora, based on expressions inside and outside the NEs. These indices are intended to estimate the difficulty of each task without actually using an NE system and to be unbiased towards a specific system. The values of the indices are compared with the systems' performance in Japanese documents. We also discuss the difference between NE classes with the indices and show useful clues which will make it easier to recognize NEs.

View full abstract

Download PDF (2107K)
Term Extraction Based on Occurrence and Concatenation Frequency

HIROSHI NAKAGAWA, HIROAKI YUMOTO, TATSUNORI MORI

2003 Volume 10 Issue 1 Pages 27-45
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.27

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we propose a new idea of automatically recognizing domain specific terms from monolingual corpus. The majority of domain specific terms are compound nouns that we aim at extracting. Our idea is based on single-noun statistics calculated with single-noun bigrams. Namely we focus on how many nouns adjoin the noun in question to form compound nouns. In addition, we combine this measure and frequency of each compound nouns and single-nouns, which we call FLR method. We experimentally evaluate these methods on NTCIR1 TMREC test collection. As the results, when we take into account less than 1, 400 or more than 12, 000 highest term candidates, FLR method performs best.

View full abstract

Download PDF (6325K)
Removing Similar Documents from Training Samples and Applying Rocchio feedback as week learner of AdaBoost for Improving Relevance Feedback

HIROYUKI NAKAJIMA

2003 Volume 10 Issue 1 Pages 47-61
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.47

JOURNAL FREE ACCESS

Show abstractHide abstract

Relevance feedback is a method to achieve accurate information retrieval through the application of both evaluated relevant and irrelevant sample documents which are chosen based on an initial query by the system. Retrieval accuracy achieved by relevance feedback changes according to the sample selection methodology for user judgement. Relevance sampling, a method which is often utilized for choosing samples, asks users to label the sample documents which are classified as most likely to be relevant. On the other hand, uncertainty sampling, a method which selects samples according to an unclear classification system, has been reported to be more effective than the original sampling method. However, both sampling methods may select multiple numbers of similar documents, and thus in both cases, the retrieval accuracy should be improved. In this paper, we propose ‘unfamiliar sampling’, a method which evaluates the distance between each pair of documents and removes a candidate for sampling if it is the nearest neighbor of any sample that has already been selected. With this procedure, not only multiple numbers of similar documents are not used for sampling, but, moreover, retrieval accuracy improves. In applying relevance feedback for document retrieval, it is important to achieve high retrieval accuracy with a small number of sample documents. Also, in this paper, we propose ‘Rocchio-Boost’, which applies Rocchio feedback as a weekly learner of AdaBoost. Furthermore, we show that it can achieve high retrieval accuracy. Empirical results on NPL test collection show that the proposed methods improve the average precision of retrieval by 6% over the original relevance sampling and Rocchio feedback.

View full abstract

Download PDF (1485K)
An IR Similarity Measure which is Tolerant for Morphological Variation

EIKO YAMAMOTO, YOSHIYUKI TAKEDA, KYOJI UMEMURA

2003 Volume 10 Issue 1 Pages 63-80
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.63

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we propose a measure for information retrieval (IR). This measure is tolerant for morphological variation. When various persons describe the data to retrieve, their notations may vary even if the data describe the same topic. This variation prevents system to retrieve all of relevant documents for the input sentence. Although human can handle this variation, computers usually can not handle this. Edit distance is a well-known measure that can cope with this variation. We have used this measure for information retrieval and found that its precision is poor. Therefore, we propose to modify this similarity measure to be suitable for information retrieval. We show that this extension improves the performance. We also compared the proposed similarity measure with the popular similarity measures used in many information retrieval systems.

View full abstract

Download PDF (2023K)
Machine Learning Approach to Multi-Document Summarization

TSUTOMU HIRAO, HIDETO KAZAWA, HIDEKI ISOZAKI, EISAKU MAEDA, YUJI MATSU ...

2003 Volume 10 Issue 1 Pages 81-108
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.81

JOURNAL FREE ACCESS

Show abstractHide abstract

Due to the rapid growth of the Internet and the emergence of low-price and largecapacity storage devices, the number of online documents is exploding. Automatic summarization is the key handling this situation. The cost of manual work demands that we be able to summarize a document set related to a certain event. This paper proposes a method of extracting important sentences from document sets. The method is based on Support Vector Machines, a technology that is attracting attention in the field of natural language processing. We conducted experiments using three document sets formed from twelve events published in the MAINICHI newspaper of 1999. These sets were manually processed by newspaper editors. Tests using this corpus show that our method has better performance than either the Lead-based method or the TF-IDF method. Moreover, we clarify that reducing redundancy is not always effective for extracting important sentences from a set of multiple documents taken from a single source.

View full abstract

Download PDF (2787K)
Performance Evaluation of Chinese Analyzers with Support Vector Machines

TATSUMI YOSHIDA, KIYONORI OHTAKE, KAZUHIDE YAMAMOTO

2003 Volume 10 Issue 1 Pages 109-131
Published: January 10, 2003
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.10.109

JOURNAL FREE ACCESS

Show abstractHide abstract

We will report performances of currently and publicly available Chinese analyzers and resources. We use YamCha, a tool based on Support Vector Machines, and the Penn Chinese Treebank as a language resource. Combining these two, we measure the performances of Chinese analysis, i. e., word segmentation, part-of-speech tagging, and base phrase chunking. In the experiment of word segmentation and part-of-speech tagging, we also report the performance of MOZ, a statistical morphological analyzer, which is also available to the public. We found that the accuracy of morphological analysis using YamCha attains around 88%, which is over 4% higher than that of MOZ, although it is computationally very expensive. We also found that the accuracy for base phrase chunking is approximately 93%.

View full abstract

Download PDF (2118K)

Register with J-STAGE for free!