With the increasing number of medical documents written in an electronic format, automatic term extraction technologies from unstructured texts have become increasingly important. Particularly, the extraction of medical terms such as complaints and diagnoses from medical records is crucial because they serve as the basis for more application-oriented tasks, including medical case retrieval. For machine-learning-based term extraction, language resources such as lexica and corpora are effective for recognizing expressions that rarely or do not occur in training data. However, the use of lexica by simple word-matching approaches has limited effects because there are compound words that comprise various combinations of constituent terms in medical records. Therefore, this study presents term extraction systems that can exploit language resources by the acquisition and utilization of beneficial terms and constituents from the resources. Our experimental results on the NTCIR-10 MedNLP test collection, which comprises medical history summaries, show increased precision and recall, indicating the effectiveness of the proposed system. Moreover, compared to existing systems developed for the NTCIR-10 MedNLP task, the proposed system achieved optimum performance for complaint and diagnosis recognition, including the classification of extracted terms into modality attributes.
View full abstract