Journal of Natural Language Processing

Preface

[title in Japanese]

[in Japanese]

2009Volume 16Issue 2 Pages 2_1-2_2
Published: 2009
Released on J-STAGE: September 01, 2011

DOIhttps://doi.org/10.5715/jnlp.16.2_1

JOURNAL FREE ACCESS

Download PDF (114K)

Paper

Establishment of Corpus-based Cancer Specific Term Set and its Characteristics

Shin-ichi Nakagawa, Masao Utiyama, Makoto Misumi, Akira Shimazu, Yoshi ...

2009Volume 16Issue 2 Pages 2_3-2_44
Published: 2009
Released on J-STAGE: September 01, 2011

DOIhttps://doi.org/10.5715/jnlp.16.2_3

JOURNAL FREE ACCESS

Show abstractHide abstract

For providing the appropriate cancer information to patients, we made the Corpus-based Cancer Term Set as the basic linguistic infrastructure for analyzing cancer contents. The specific terms of cancer was carried out by the qualified medical doctors by cutting out each word using the whole web contents of the National Cancer Center as the authorized corpus. Out of over 26,000 words that were carried out, 10,199 terms were finally collected as the Cancer Terms Candidate (Cc.) This term set covers 96.5–99.5% of 10 different kinds of cancer content, which is enough for analysis. Considering the contrast between this cancer word set and other word set, such as general words, general medical words and proper nouns, the Cc was investigated based on selection standards. As a result, 93.7% terms of Cc was selected into the new word set “C.” Secondly, based on the relationship between general terms and cancer/medical terminology, as well as on the consistency of the glossary, the selection criteria (T1: Cancer itself, T2: Terms directly related to cancer, T3: Terms related to both T1 and T2, and T4: Terms of unclear relations to cancer) were proposed. As they were adapted to Cc, 93.7% met the criteria, 690 words were removed, and 9,509 were selected as the C word in terms of cancer. These terms were selected according to the criteria to create the word set for doctors to test, which indicates that the criteria for selection were indirectly evaluated. As a result, in two cases where the word set was split into T1 and (T2, T3, T4,) and where it was split into (T1, T2) and (T3, T4), coefficient of contingency, “κ,” was 0.6. And in case where into the word set was split into T1, T2, (T3, T4) was 0.5. And in case where into the word set was split into T1, T2, (T3, T4) was 0.5. These “κ” values were higher than in the different test; making the simple question “Cancer word or not.” Thus, the selection and classification of T1 and T2 terminology is plausible. Furthermore, the comparison analysis of detected words were performed for original several cancer corpus using HN : (auto-specific-word-selecting algorithm (Gen-Sen-Web)) and C. As the result, the recall rate of HN for C was around 80%, however the precision rate of HN for C was around 60%. Thus, these automatic word selecting methods are useful for evaluation of consistency for C. However, the reducing the ignore words selection must be required for those systems. Therefore, it was suggested that this method enabled us to create a low-cost, feasible cancer-specific term set. Thus, the selection and classification of T1 and T2 terminology is plausible. Therefore, it was suggested that this method enabled us to create a low-cost, feasible cancer-specific term set.

View full abstract

Download PDF (1554K)
Language Processing Technology and Educational Material Development—Generating English Educational Material using a Database Software—

Kenichi Kamiya, Shosaku Tanaka, Kenji Kitao

2009Volume 16Issue 2 Pages 2_45-2_58
Published: 2009
Released on J-STAGE: September 01, 2011

DOIhttps://doi.org/10.5715/jnlp.16.2_45

JOURNAL FREE ACCESS

Show abstractHide abstract

This article provides an example of developing educational material using a database software, linking it with language processing technology. Teachers can download our software for free and create worksheets for studying phrase reading and e-learning materials based on cloze exercises. This software makes creating such learning materials very efficient, and provides integrated functions which are almost impossible to do manually. Since the operations can be done on graphical user interface, or GUI, even computer novices can use the software easily.

View full abstract

Download PDF (1890K)
Supervised Synonym Acquisition Using Distributional Features and Syntactic Patterns

Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama

2009Volume 16Issue 2 Pages 2_59-2_83
Published: 2009
Released on J-STAGE: September 01, 2011

DOIhttps://doi.org/10.5715/jnlp.16.2_59

JOURNAL FREE ACCESS

Show abstractHide abstract

Distributional similarity has been widely used to capture the semantic relatedness of words in many NLP tasks. However, parameters such as similarity measures must be manually tuned to make distributional similarity work effectively. To address this problem, we propose a novel approach to synonym identification based on supervised learning and distributional features, which correspond to the commonality of individual context types shared by word pairs. This approach also enables the integration with pattern-based features. In our experiment, we have built and compared eight synonym classifiers, and showed a drastic performance increase of over 60% on F-1 measure, compared to the conventional similarity-based classification. Distributional features that we have proposed are better in classifying synonyms than the conventional common features, while the pattern-based features have appeared almost redundant.

View full abstract

Download PDF (203K)

Register with J-STAGE for free!