Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 27, Issue 1
Displaying 1-6 of 6 articles from this issue
Preface
General Paper
  • Takuya Hara, Takuya Matuzaki, Hikaru Yokono, Satoshi Sato
    2020 Volume 27 Issue 1 Pages 3-30
    Published: March 15, 2020
    Released on J-STAGE: June 15, 2020
    JOURNAL FREE ACCESS

    This paper reports on an analysis of the effect of additional training of Japanese dependency parsers across multiple domains from a bird’s-eye view. Parsing errors were collected before and after additional training using target domain data. We conducted cluster analysis of the parsing errors represented as dense real vectors, which were obtained from the internal states of the parser. Through quantitative and qualitative analysis of the clusters, the types and numbers of the parsing errors across multiple target domains were investigated. Several hypotheses concerning the effect of additional training were developed on the basis of the cluster analysis and verified through statistical analysis of the corpus. The results suggest that the main effect of additional training was learning the difference in the distributions of the correct syntactic structures for similar word sequences in different domains.

    Download PDF (3277K)
  • Kosuke Oya, Kotaro Sakamoto, Hideyuki Shibuki, Tatsunori Mori
    2020 Volume 27 Issue 1 Pages 31-63
    Published: March 15, 2020
    Released on J-STAGE: June 15, 2020
    JOURNAL FREE ACCESS

    In this paper, we focus on the usage of a world history glossary as one of the knowledge sources for automated answer generation of essay-type questions. The questions we use were derived the University of Tokyo’s entrance examinations on world history, and the answer generation uses a multi-document summarization methodology. In the automated answer generation, the glossary’s descriptions we used as part of the answers. However, entry words were often omitted from the descriptions. To make complete sentences from entry words and their description, we propose a method to find zero pronouns referring to entry words inside the descriptions and estimate their surface cases. This paper’s task differs from conventional zero anaphora resolution in the following two ways. First, with this method the entry word is the only candidate for the antecedent, as opposed to having to select one zero pronoun among several zero pronouns. Second, context information of the antecedent, which may be a useful clue in anaphora resolution, does not exist for the entry words, because the entry word appears alone and is not embedded in a sentence. Evaluation results based on a world history glossary revealed that the proposed method would be more effective than the existing method using zero anaphora analysis with Kurohashi-Nagao Parser. Furthermore, we attempted a method to generate pseudo training data from ordinary sentences in a textbook because we observed low accuracy when the entry word was embedded with low-frequency surface cases. Additionally, the results demonstrated that the introduction of the data improves the estimation of “o”-case and “ni”-case in terms of F-measure.

    Download PDF (1799K)
System Paper
  • Yuiko Tsunomori, Ryuichiro Higashinaka, Takeshi Yoshimura, Yoshinori I ...
    2020 Volume 27 Issue 1 Pages 65-88
    Published: March 15, 2020
    Released on J-STAGE: June 15, 2020
    JOURNAL FREE ACCESS

    In our commercial chat-oriented dialogue system, we have been using an utterance database created from a massive amount of predicate-argument structures extracted from the web for generating utterances. However, because the creation of this database involves several automated processes, the database often includes non-sentences (ungrammatical or uninterpretable sentences) and utterances with inappropriate topic information (called off-focus utterances). Additionally, utterances tend to be monotonous and uninformative because they are created from single predicate-argument structures. To resolve these problems, we propose methods for filtering non-sentences by using neural network-based methods and utterances inappropriate for their associated foci by using co-occurrence statistics. To reduce monotony, we also propose a method for concatenating automatically generated utterances so that the utterances can be longer and richer in content. Experimental results indicate that the non-sentence filter can successfully remove non-sentences with an accuracy of 95% and that our focus filter can filter utterances inappropriate for their foci with high recall. We also examine the effectiveness of our filtering methods and concatenation method through an experiment involving human participants. The experimental results indicate that our methods significantly outperform a baseline in terms of understandability and that the concatenation of two utterances leads to higher familiarity and content richness while retaining understandability.

    Download PDF (554K)
  • Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi
    2020 Volume 27 Issue 1 Pages 89-132
    Published: March 15, 2020
    Released on J-STAGE: June 15, 2020
    JOURNAL FREE ACCESS

    An NLP tool is practical when it is fast in addition to having high accuracy. We describe the architecture and the used methods to achieve 250× analysis speed improvement on the Juman++ morphological analyzer together with slight accuracy improvements. This information should be useful for implementors of high-performance NLP and machine-learning based software.

    Download PDF (1835K)
Report
  • Masayuki Asahara
    2020 Volume 27 Issue 1 Pages 133-150
    Published: March 15, 2020
    Released on J-STAGE: June 15, 2020
    JOURNAL FREE ACCESS

    This paper presents research on word familiarity rate estimation using the ‘Word List by Semantic Principles’. We collected rating information on 96,557 words in the ‘Word List by Semantic Principles’ via Yahoo! crowdsourcing. We asked 3,392 subject participants to use their introspection to rate the familiarity and register information of words based on the five perspectives of ‘KNOW’, ‘WRITE’, ‘READ’, ‘SPEAK’, and ‘LISTEN’, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the ‘Word List by Semantic Principles’.

    Download PDF (920K)
feedback
Top