Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 24, Issue 5
Displaying 1-5 of 5 articles from this issue
Preface
Paper
  • Suzushi Tomori, Takashi Ninomiya, Shinsuke Mori
    2017 Volume 24 Issue 5 Pages 655-668
    Published: December 15, 2017
    Released on J-STAGE: March 15, 2018
    JOURNAL FREE ACCESS

    In this paper, we propose a method that utilizes real-world data to improve named entity recognition (NER) for a particular domain. Our proposed method integrates a stacked auto-encoder (SAE) and a text-based deep neural network for achieving NER. Initially, we train the SAE using real-world data, then the entire deep neural network from sentences annotated with named entities (NEs) and accompanied by real world information. In our experiments, we chose Japanese chess as our subject. The dataset consists of pairs of a game state and commentary sentences about it annotated with game-specific NE tags. We conducted NER experiments and verified that referring to real-world data improves the NER accuracy.

    Download PDF (598K)
  • Fei Cheng, Kevin Duh, Yuji Matsumoto
    2017 Volume 24 Issue 5 Pages 669-686
    Published: December 15, 2017
    Released on J-STAGE: March 15, 2018
    JOURNAL FREE ACCESS

    One of the crucial problems facing current Chinese natural language processing (NLP) is the ambiguity of word boundaries, which raises many further issues, such as different word segmentation standards and the prevalence of out-of-vocabulary (OOV) words. We assume that such issues can be better handled if a consistent segmentation level is created among multiple corpora. In this paper, we propose a simple strategy to transform two different Chinese word segmentation (CWS) corpora into a new consistent segmentation level, which enables easy extension of the training data size. The extended data is verified to be highly consistent by 10-fold cross-validation. In addition, we use a synthetic word parser to analyze the internal structure information of the words in the extended training data to convert the data into a more fine-grained standard. Then we use two-stage Conditional Random Fields (CRFs) to perform fine-grained segmentation and chunk the segments back to the original Peking University (PKU) or Microsoft Research (MSR) standard. Due to the extension of the training data and reduction of the OOV rate in the new fine-grained level, the proposed system achieves state-of-the-art segmentation recall and F-score on the PKU and MSR corpora.

    Download PDF (587K)
  • Ryohei Sasano, Manabu Okumura
    2017 Volume 24 Issue 5 Pages 687-703
    Published: December 15, 2017
    Released on J-STAGE: March 15, 2018
    JOURNAL FREE ACCESS

    Several studies have investigated the canonical word order of Japanese double object constructions. However, most of these studies use either manual analyses or measurements of human characteristics such as brain activities or reading times for each example. Thus, although these analyses are reliable for the examples they focus on, the findings cannot be generalized to other examples. In contrast, the trend of actual usage can be automatically collected from a large corpus. Thus, in this study, we assume that there is a relation between the canonical word order and the proportion of each word order in a large corpus and present a corpus-based analysis of the canonical word order of Japanese double object constructions. Our analysis is based on a very large corpus comprising more than 10 billion unique sentences and suggests that the canonical word order of such constructions varies from verb to verb. Moreover, it suggests an argument whose grammatical case is infrequently omitted with a given verb tends to be placed near the verb and that there is few relation between the canonical word order and the verb type: show-type and pass-type. The dative-accusative order is more preferred when the semantic role of dative argument is animate Possessor than when the semantic role is inanimate Goal. Furthermore, an argument that frequently co-occurs with the verb tends to be placed near the verb.

    Download PDF (582K)
Report
  • Hiroyuki Shinnou, Masayuki Asahara, Kanako Komiya, Minoru Sasaki
    2017 Volume 24 Issue 5 Pages 705-720
    Published: December 15, 2017
    Released on J-STAGE: March 15, 2018
    JOURNAL FREE ACCESS

    We constructed word embedding data (named as ‘nwjc2vec’) using the NINJAL Web Japanese Corpus and word2vec software, and released it publicly. In this report, nwjc2vec is introduced, and the result of two types of experiments that were conducted to evaluate the quality of nwjc2vec is shown. In the first experiment, the evaluation based on word similarity is considered. Using a word similarity dataset, we calculate Spearman’s rank correlation coefficient. In the second experiment, the evaluation based on task is considered. As the task, we consider word sense disambiguation (WSD) and language model construction using Recurrent Neural Network (RNN). The results obtained using the nwjc2vec were compared with the results obtained using word embedding constructed from the article data of newspaper for seven years. The nwjc2vec is shown to be high quality.

    Download PDF (451K)
feedback
Top