Purposes: The purposes of this study were to automatically extract full forms from abbreviations by using Word2vec for terminology expansion and determine the optimal parameters that ensure the highest accuracy. Methods: Approximately 300000 English abstracts on “image diagnosis” were collected using PubMed from January 1994 to December 2018. As preprocessing, all uppercase letters in the collected data were converted to lowercase letters, and symbols were deleted. In addition, compound word recognition was performed using RadLex published by the Radiological Society of North America and the abbreviation collection published by the Japanese Society of Radiological Technology. Next, distributed representations were generated by two algorithms, continuous bag-of-words (CBOW) and Skip-gram, by using the following parameters: iteration numbers (3–85) and dimensions of word vectors (50–1000). Abbreviations were input to the generated distributed representations, and full forms with the highest cosine similarities with the abbreviations were identified. Then, the rates of the correct answers were calculated by comparing the predicted full forms to 214 gold standards extracted from the abbreviation collection. Results: The highest correct answer rate was 74.3% by Skip-gram, 200 dimensions and 10 iterations. This rate was higher in Skip-gram than in CBOW for all the tested conditions. Conclusion: The accuracy of extracting the full forms by Word2vec is 74.3%, and this result contributes to the consistency of a terminology and the efficiency of terminology expansion.
抄録全体を表示