For providing the appropriate cancer information to patients, we made the Corpus-based Cancer Term Set as the basic linguistic infrastructure for analyzing cancer contents. The specific terms of cancer was carried out by the qualified medical doctors by cutting out each word using the whole web contents of the National Cancer Center as the authorized corpus. Out of over 26,000 words that were carried out, 10,199 terms were finally collected as the Cancer Terms Candidate (Cc.) This term set covers 96.5–99.5% of 10 different kinds of cancer content, which is enough for analysis. Considering the contrast between this cancer word set and other word set, such as general words, general medical words and proper nouns, the Cc was investigated based on selection standards. As a result, 93.7% terms of Cc was selected into the new word set “C.” Secondly, based on the relationship between general terms and cancer/medical terminology, as well as on the consistency of the glossary, the selection criteria (T1: Cancer itself, T2: Terms directly related to cancer, T3: Terms related to both T1 and T2, and T4: Terms of unclear relations to cancer) were proposed. As they were adapted to Cc, 93.7% met the criteria, 690 words were removed, and 9,509 were selected as the C word in terms of cancer. These terms were selected according to the criteria to create the word set for doctors to test, which indicates that the criteria for selection were indirectly evaluated. As a result, in two cases where the word set was split into T1 and (T2, T3, T4,) and where it was split into (T1, T2) and (T3, T4), coefficient of contingency, “κ,” was 0.6. And in case where into the word set was split into T1, T2, (T3, T4) was 0.5. And in case where into the word set was split into T1, T2, (T3, T4) was 0.5. These “κ” values were higher than in the different test; making the simple question “Cancer word or not.” Thus, the selection and classification of T1 and T2 terminology is plausible. Furthermore, the comparison analysis of detected words were performed for original several cancer corpus using HN : (auto-specific-word-selecting algorithm (Gen-Sen-Web)) and C. As the result, the recall rate of HN for C was around 80%, however the precision rate of HN for C was around 60%. Thus, these automatic word selecting methods are useful for evaluation of consistency for C. However, the reducing the ignore words selection must be required for those systems. Therefore, it was suggested that this method enabled us to create a low-cost, feasible cancer-specific term set. Thus, the selection and classification of T1 and T2 terminology is plausible. Therefore, it was suggested that this method enabled us to create a low-cost, feasible cancer-specific term set.
View full abstract