Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Report
nwjc2vec: Word Embedding Data Constructed from NINJAL Web Japanese Corpus
Hiroyuki ShinnouMasayuki AsaharaKanako KomiyaMinoru Sasaki
Author information
JOURNAL FREE ACCESS

2017 Volume 24 Issue 5 Pages 705-720

Details
Abstract

We constructed word embedding data (named as ‘nwjc2vec’) using the NINJAL Web Japanese Corpus and word2vec software, and released it publicly. In this report, nwjc2vec is introduced, and the result of two types of experiments that were conducted to evaluate the quality of nwjc2vec is shown. In the first experiment, the evaluation based on word similarity is considered. Using a word similarity dataset, we calculate Spearman’s rank correlation coefficient. In the second experiment, the evaluation based on task is considered. As the task, we consider word sense disambiguation (WSD) and language model construction using Recurrent Neural Network (RNN). The results obtained using the nwjc2vec were compared with the results obtained using word embedding constructed from the article data of newspaper for seven years. The nwjc2vec is shown to be high quality.

Content from these authors
© 2017 The Association for Natural Language Processing
Previous article
feedback
Top