Journal of Natural Language Processing

Preface

[title in Japanese]

[in Japanese]

2017 Volume 24 Issue 5 Pages 653-654
Published: December 15, 2017
Released on J-STAGE: March 15, 2018

DOIhttps://doi.org/10.5715/jnlp.24.653

JOURNAL FREE ACCESS

Download PDF (134K)

Paper

Improvement in Domain-Specific Named Entity Recognition by Utilizing the Real-World Data

Suzushi Tomori, Takashi Ninomiya, Shinsuke Mori

2017 Volume 24 Issue 5 Pages 655-668
Published: December 15, 2017
Released on J-STAGE: March 15, 2018

DOIhttps://doi.org/10.5715/jnlp.24.655

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we propose a method that utilizes real-world data to improve named entity recognition (NER) for a particular domain. Our proposed method integrates a stacked auto-encoder (SAE) and a text-based deep neural network for achieving NER. Initially, we train the SAE using real-world data, then the entire deep neural network from sentences annotated with named entities (NEs) and accompanied by real world information. In our experiments, we chose Japanese chess as our subject. The dataset consists of pairs of a game state and commentary sentences about it annotated with game-specific NE tags. We conducted NER experiments and verified that referring to real-world data improves the NER accuracy.

View full abstract

Download PDF (598K)
Towards a Consistent Segmentation Level across Multiple Chinese Word Segmentation Corpora

Fei Cheng, Kevin Duh, Yuji Matsumoto

2017 Volume 24 Issue 5 Pages 669-686
Published: December 15, 2017
Released on J-STAGE: March 15, 2018

DOIhttps://doi.org/10.5715/jnlp.24.669

JOURNAL FREE ACCESS

Show abstractHide abstract

One of the crucial problems facing current Chinese natural language processing (NLP) is the ambiguity of word boundaries, which raises many further issues, such as different word segmentation standards and the prevalence of out-of-vocabulary (OOV) words. We assume that such issues can be better handled if a consistent segmentation level is created among multiple corpora. In this paper, we propose a simple strategy to transform two different Chinese word segmentation (CWS) corpora into a new consistent segmentation level, which enables easy extension of the training data size. The extended data is verified to be highly consistent by 10-fold cross-validation. In addition, we use a synthetic word parser to analyze the internal structure information of the words in the extended training data to convert the data into a more fine-grained standard. Then we use two-stage Conditional Random Fields (CRFs) to perform fine-grained segmentation and chunk the segments back to the original Peking University (PKU) or Microsoft Research (MSR) standard. Due to the extension of the training data and reduction of the OOV rate in the new fine-grained level, the proposed system achieves state-of-the-art segmentation recall and F-score on the PKU and MSR corpora.

View full abstract

Download PDF (587K)
Corpus-Based Analysis of the Canonical Word Order of Japanese Double Object Constructions

Ryohei Sasano, Manabu Okumura

2017 Volume 24 Issue 5 Pages 687-703
Published: December 15, 2017
Released on J-STAGE: March 15, 2018

DOIhttps://doi.org/10.5715/jnlp.24.687

JOURNAL FREE ACCESS

Show abstractHide abstract

Several studies have investigated the canonical word order of Japanese double object constructions. However, most of these studies use either manual analyses or measurements of human characteristics such as brain activities or reading times for each example. Thus, although these analyses are reliable for the examples they focus on, the findings cannot be generalized to other examples. In contrast, the trend of actual usage can be automatically collected from a large corpus. Thus, in this study, we assume that there is a relation between the canonical word order and the proportion of each word order in a large corpus and present a corpus-based analysis of the canonical word order of Japanese double object constructions. Our analysis is based on a very large corpus comprising more than 10 billion unique sentences and suggests that the canonical word order of such constructions varies from verb to verb. Moreover, it suggests an argument whose grammatical case is infrequently omitted with a given verb tends to be placed near the verb and that there is few relation between the canonical word order and the verb type: show-type and pass-type. The dative-accusative order is more preferred when the semantic role of dative argument is animate Possessor than when the semantic role is inanimate Goal. Furthermore, an argument that frequently co-occurs with the verb tends to be placed near the verb.

View full abstract

Download PDF (582K)

Report

nwjc2vec: Word Embedding Data Constructed from NINJAL Web Japanese Corpus

Hiroyuki Shinnou, Masayuki Asahara, Kanako Komiya, Minoru Sasaki

2017 Volume 24 Issue 5 Pages 705-720
Published: December 15, 2017
Released on J-STAGE: March 15, 2018

DOIhttps://doi.org/10.5715/jnlp.24.705

JOURNAL FREE ACCESS

Show abstractHide abstract

We constructed word embedding data (named as ‘nwjc2vec’) using the NINJAL Web Japanese Corpus and word2vec software, and released it publicly. In this report, nwjc2vec is introduced, and the result of two types of experiments that were conducted to evaluate the quality of nwjc2vec is shown. In the first experiment, the evaluation based on word similarity is considered. Using a word similarity dataset, we calculate Spearman’s rank correlation coefficient. In the second experiment, the evaluation based on task is considered. As the task, we consider word sense disambiguation (WSD) and language model construction using Recurrent Neural Network (RNN). The results obtained using the nwjc2vec were compared with the results obtained using word embedding constructed from the article data of newspaper for seven years. The nwjc2vec is shown to be high quality.

View full abstract

Download PDF (451K)

Register with J-STAGE for free!