Natural language processing for medical applications (medical NLP) requires high-quality annotated corpora. In this study, we designed a versatile annotation scheme for clinical-medical text and a set of associated guidelines, which address two common subtasks used in medical NLP: named entity recognition (NER) and relation extraction (RE). The annotation scheme integrates similar existing schemes and defines clinical-medical entities and relations to encode useful information for many medical NLP applications. The guidelines aim to increase the annotation feasibility by reducing the necessity of judgement based on medical knowledge so as to enable non-medical professionals to annotate the text. We adopted a recursive discussion procedure involving NLP researchers, medical professionals, and annotators to develop the scheme and guidelines based on real annotation examples while increasing the corpus size. Further, we obtained annotated corpora comprising 3,769 medical records and radiology reports of patients with serious lung diseases. For improved efficiency, preliminary NER and RE models were created after the first half was annotated; they were subsequently applied to the second half, which was then corrected manually. This two-step annotation also increased the inter-coder agreement. Finally, a joint NER + RE model trained on our corpora showed sufficiently promising performance to suggest its practical implementation.
View full abstract