Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 16, Issue 2
Displaying 1-4 of 4 articles from this issue
Preface
Paper
  • Shin-ichi Nakagawa, Masao Utiyama, Makoto Misumi, Akira Shimazu, Yoshi ...
    2009 Volume 16 Issue 2 Pages 2_3-2_44
    Published: 2009
    Released on J-STAGE: September 01, 2011
    JOURNAL FREE ACCESS
    For providing the appropriate cancer information to patients, we made the Corpus-based Cancer Term Set as the basic linguistic infrastructure for analyzing cancer contents. The specific terms of cancer was carried out by the qualified medical doctors by cutting out each word using the whole web contents of the National Cancer Center as the authorized corpus. Out of over 26,000 words that were carried out, 10,199 terms were finally collected as the Cancer Terms Candidate (Cc.) This term set covers 96.5–99.5% of 10 different kinds of cancer content, which is enough for analysis. Considering the contrast between this cancer word set and other word set, such as general words, general medical words and proper nouns, the Cc was investigated based on selection standards. As a result, 93.7% terms of Cc was selected into the new word set “C.” Secondly, based on the relationship between general terms and cancer/medical terminology, as well as on the consistency of the glossary, the selection criteria (T1: Cancer itself, T2: Terms directly related to cancer, T3: Terms related to both T1 and T2, and T4: Terms of unclear relations to cancer) were proposed. As they were adapted to Cc, 93.7% met the criteria, 690 words were removed, and 9,509 were selected as the C word in terms of cancer. These terms were selected according to the criteria to create the word set for doctors to test, which indicates that the criteria for selection were indirectly evaluated. As a result, in two cases where the word set was split into T1 and (T2, T3, T4,) and where it was split into (T1, T2) and (T3, T4), coefficient of contingency, “κ,” was 0.6. And in case where into the word set was split into T1, T2, (T3, T4) was 0.5. And in case where into the word set was split into T1, T2, (T3, T4) was 0.5. These “κ” values were higher than in the different test; making the simple question “Cancer word or not.” Thus, the selection and classification of T1 and T2 terminology is plausible. Furthermore, the comparison analysis of detected words were performed for original several cancer corpus using HN : (auto-specific-word-selecting algorithm (Gen-Sen-Web)) and C. As the result, the recall rate of HN for C was around 80%, however the precision rate of HN for C was around 60%. Thus, these automatic word selecting methods are useful for evaluation of consistency for C. However, the reducing the ignore words selection must be required for those systems. Therefore, it was suggested that this method enabled us to create a low-cost, feasible cancer-specific term set. Thus, the selection and classification of T1 and T2 terminology is plausible. Therefore, it was suggested that this method enabled us to create a low-cost, feasible cancer-specific term set.
    Download PDF (1554K)
  • Kenichi Kamiya, Shosaku Tanaka, Kenji Kitao
    2009 Volume 16 Issue 2 Pages 2_45-2_58
    Published: 2009
    Released on J-STAGE: September 01, 2011
    JOURNAL FREE ACCESS
    This article provides an example of developing educational material using a database software, linking it with language processing technology. Teachers can download our software for free and create worksheets for studying phrase reading and e-learning materials based on cloze exercises. This software makes creating such learning materials very efficient, and provides integrated functions which are almost impossible to do manually. Since the operations can be done on graphical user interface, or GUI, even computer novices can use the software easily.
    Download PDF (1890K)
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama
    2009 Volume 16 Issue 2 Pages 2_59-2_83
    Published: 2009
    Released on J-STAGE: September 01, 2011
    JOURNAL FREE ACCESS
    Distributional similarity has been widely used to capture the semantic relatedness of words in many NLP tasks. However, parameters such as similarity measures must be manually tuned to make distributional similarity work effectively. To address this problem, we propose a novel approach to synonym identification based on supervised learning and distributional features, which correspond to the commonality of individual context types shared by word pairs. This approach also enables the integration with pattern-based features. In our experiment, we have built and compared eight synonym classifiers, and showed a drastic performance increase of over 60% on F-1 measure, compared to the conventional similarity-based classification. Distributional features that we have proposed are better in classifying synonyms than the conventional common features, while the pattern-based features have appeared almost redundant.
    Download PDF (203K)
feedback
Top