自然言語処理

Preface

Special Issue: “Collection of Best Annual Papers” Organized for the 20th Anniversary of the Association for Natural Language Processing

Makoto Nagao

2014 年 21 巻 4 号 p. 617-618
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.617

ジャーナルフリー

PDF形式でダウンロード (29K)

Paper

Using WFSTs for Efficient EM Learning of Probabilistic CFGs and Their Extensions

Yoshitaka Kameya, Takashi Mori, Taisuke Sato

2014 年 21 巻 4 号 p. 619-658
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.619

ジャーナルフリー

抄録を表示する抄録を非表示にする

Probabilistic context-free grammars (PCFGs) are a widely known class of probabilistic language models. The Inside-Outside (I-O) algorithm is well known as an efficient EM algorithm tailored for PCFGs. Although the algorithm requires inexpensive linguistic resources, there remains a problem in its efficiency. This paper presents an efficient method for training PCFG parameters in which the parser is separated from the EM algorithm, assuming that the underlying CFG is given. A new EM algorithm exploits the compactness of well-formed substring tables (WFSTs) generated by the parser. Our proposal is general in that the input grammar need not take Chomsky normal form (CNF) while it is equivalent to the I-O algorithm in the CNF case. In addition, we propose a polynomial-time EM algorithm for CFGs with context-sensitive probabilities, and report experimental results with the ATR dialogue corpus and a hand-crafted Japanese grammar.

抄録全体を表示

PDF形式でダウンロード (931K)
Construction of Practical Japanese Parsing System Based on Lexical Functional Grammar

Hiroshi Masuichi, Tomoko Ohkuma

2014 年 21 巻 4 号 p. 659-677
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.659

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper describes a Japanese parsing system with a linguistically fine-grained grammar based on Lexical-Functional Grammar (LFG). The system is the first Japanese LFG parser with over 97% coverage of real-world text. We evaluated the accuracy of the system by comparing it with standard Japanese dependency parsers. The LFG parser shows roughly equivalent performance in dependency accuracy with standard parsers. It also provides reasonably accurate results of case detection.

抄録全体を表示

PDF形式でダウンロード (1463K)
Gradual Fertilization of Case Frames

Daisuke Kawahara, Sadao Kurohashi

2014 年 21 巻 4 号 p. 679-706
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.679

ジャーナルフリー

抄録を表示する抄録を非表示にする

This article proposes an automatic method of gradually constructing case frames. First, a large raw corpus is parsed, and base case frames are constructed from reliable predicate-argument examples in the parsing results. Second, case analysis based on the base case frames is applied to the large corpus, and the case frames are upgraded by incorporating newly acquired information. Case frames are gradually fertilized in this way. We constructed case frames from 26 years of newspaper articles consisting of approximately 26 million sentences. The case frames are evaluated manually as well as through syntactic and case analyses. These results presented the effectiveness of the constructed case frames.

抄録全体を表示

PDF形式でダウンロード (430K)
Improvements of Katz K Mixture Model

Yinghui Xu, Kyoji Umemura

2014 年 21 巻 4 号 p. 707-732
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.707

ジャーナルフリー

抄録を表示する抄録を非表示にする

A simpler distribution that fits empirical word distribution about as well as a negative binomial is the Katz K mixture. In the K mixture model, the basic assumption is that the conditional probabilities of repeats for a given word are determined by a constant decay factor that is independent of the number of occurrences which have taken place. However, the probabilities of the repeat occurrences are generally lower than the constant decay factor for the content-bearing words with few occurrences that have taken place. To solve this deficiency of the K mixture model, in-depth exploration of the characteristics of the conditional probabilities of repetitions, decay factors and their influences on modeling term distributions was conducted. Based on the results of this study, it appears that both ends of the distribution can be used to fit models. That is, not only can document frequencies be used when the instances of a word are few, but also tail probabilities (the accumulation of document frequencies). Both document frequencies for few instances of a word and tail probabilities for large instances are often relatively easy to estimate empirically. Therefore, we propose an effective approach for improving the K mixture model, where the decay factor is the combination of two possible decay factors interpolated by a function depending on the number of instances of a word in a document. Results show that the proposed model can generate a statistically significant better estimation of frequencies, especially the frequency estimation for a word with two instances in a document. In addition, it is shown that the advantages of this approach will become more evident in two cases, modeling the term distribution for the frequently used content-bearing word and modeling the term distribution for a corpus with a wide range of document length.

抄録全体を表示

PDF形式でダウンロード (1238K)
Preference Dependency Grammar and its Packed Shared Data Structure “Dependency Forest”

Hideki Hirakawa

2014 年 21 巻 4 号 p. 733-797
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.733

ジャーナルフリー

抄録を表示する抄録を非表示にする

Preference dependency grammar (PDG) is a framework for integrating morphological, syntactic, and semantic analyses. PDG provides packed shared data structures that can efficiently encompass all possible interpretations at each level of sentence analyses with preference scores. Using the structure, PDG can calculate a globally optimized interpretation for the target sentence. This paper first gives an overview of the PDG framework by describing the base model of PDG, which is a sentence analysis model, called a “multi-level packed shared data connection model.” Then this paper describes packed shared data structures, e.g., headed parse forests and dependency forests, adopted in PDG. Finally, the completeness and soundness of the mapping between the parse forest and the dependency forest are revealed.

抄録全体を表示

PDF形式でダウンロード (3236K)
A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis

Daisuke Kawahara, Sadao Kurohashi

2014 年 21 巻 4 号 p. 799-815
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.799

ジャーナルフリー

抄録を表示する抄録を非表示にする

We present an integrated probabilistic model for Japanese syntactic and case structure analysis. Syntactic and case structures are simultaneously analyzed on the basis of wide-coverage case frames that are constructed from a huge raw corpus in an unsupervised manner. This model selects the syntactic and case structures that have the highest generative probability. We evaluate both syntactic structure and case structure. In particular, the experimental results for syntactic analysis on web sentences show that the proposed model significantly outperforms the known syntactic analyzers.

抄録全体を表示

PDF形式でダウンロード (240K)
Construction of a Domain Dictionary for Fundamental Vocabulary and its Application to Automatic Blog Categorization Using Dynamically Estimated Domains of Unknown Words

Chikara Hashimoto, Sadao Kurohashi

2014 年 21 巻 4 号 p. 817-840
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.817

ジャーナルフリー

抄録を表示する抄録を非表示にする

The semantic relations between words are essential for natural language understanding. Toward deeper natural language understanding, we semi-automatically constructed a domain dictionary that represents the domain relations between fundamental Japanese words. Our method does not require a document collection. As a task-based evaluation of the domain dictionary, we categorized blogs by assigning a domain for each word in a blog article and categorizing it as the most dominant domain. Thus, we dynamically estimated the domains of unknown words, (i.e., those not listed in the domain dictionary), resulting in our blog categorization achieving an accuracy of 94.0% (564/600). Moreover, the domain estimation technique for unknown words achieved an accuracy of 76.6% (383/500).

抄録全体を表示

PDF形式でダウンロード (395K)
Generalization of Semantic Roles in Automatic Semantic Role Labeling

Yuichiroh Matsubayashi, Naoaki Okazaki, Jun’ichi Tsujii

2014 年 21 巻 4 号 p. 841-875
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.841

ジャーナルフリー

抄録を表示する抄録を非表示にする

Numerous studies have applied machine-learning approaches to semantic role labeling with the availability of corpora such as FrameNet and PropBank. These corpora define frame-specific semantic roles for each frame, which are problematic for a machine-learning approach because the corpus contains a number of infrequent roles that hinder efficient learning. This paper focuses on the generalization problem of semantic roles in a semantic role labeling task. We compare existing generalization criteria with our novel criteria, and clarify the characteristics of each criterion. We also show that using multiple generalization criteria in a single model improves the performance of a semantic role classification. In experiments on FrameNet, we achieved 19.16% error reduction in terms of total accuracy, and 7.42% in macro-averaged F1. On PropBank, we reduced 24.07% of errors in total accuracy, and 26.39% of errors in the evaluation for unseen verbs.

抄録全体を表示

PDF形式でダウンロード (1831K)
Study on Constants of Natural Language Texts

Daisuke Kimura, Kumiko Tanaka-Ishii

2014 年 21 巻 4 号 p. 877-895
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.877

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper considers different measures that might become constants for any length of a given natural language text. Such measures indicate a potential for studying the complexity of natural language but have previously only been studied using relatively small English texts. In this study, we consider measures for texts in languages other than English, and for large-scale texts. Among the candidate measures, we consider Yule's K, Orlov's Z, and Golcher's VM, each of whose convergence has been previously argued empirically. Furthermore, we introduce entropy H, and a measure, r, related to the scale-free property of language. Our experiments show that both K and VM are convergent for texts in various languages, whereas the other measures are not.

抄録全体を表示

PDF形式でダウンロード (1151K)
Splitting Katakana Noun Compounds by Paraphrasing and Back-transliteration

Nobuhiro Kaji, Masaru Kitsuregawa

2014 年 21 巻 4 号 p. 897-920
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.897

ジャーナルフリー

抄録を表示する抄録を非表示にする

Word boundaries within noun compounds in a number of languages, including Japanese, are not marked by white spaces. Thus, it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds composed of katakana words are particularly difficult to split because katakana words are highly productive and are often out of vocabulary. Therefore, we propose using paraphrasing and back-transliteration of katakana noun compounds to split them. Experiments in which paraphrases and back-transliterations from unlabeled textual data were extracted and used to construct splitting models improved splitting accuracy with statistical significance.

抄録全体を表示

PDF形式でダウンロード (432K)
Relevance Feedback using Surface and Latent Information in Texts

Jun Harashima, Sadao Kurohashi

2014 年 21 巻 4 号 p. 921-940
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.921

ジャーナルフリー

抄録を表示する抄録を非表示にする

Most relevance feedback methods re-rank search results using only the information of surface words in texts. We present a method that uses not only the information of surface words but also that of latent words that are inferred from texts. We infer latent word distribution in each document in the search results using latent Dirichlet allocation (LDA). When feedback is given, we also infer the latent word distribution in the feedback using LDA. We calculate the similarities between the user feedback and each document in the search results using both the surface and latent word distributions and re-rank the search results on the basis of the similarities. Evaluation results show that when user feedback consisting of two documents (3,589 words) is given, the proposed method improves the initial search results by 27.6% in precision at 10 (P@10). Additionally, it proves that the proposed method can perform well even when only a small amount of user feedback is available. For example, an improvement of 5.3% in P@10 was achieved when user feedback constituted only 57 words.

抄録全体を表示

PDF形式でダウンロード (497K)
Particle Error Correction from Small Error Data for Japanese Learners

Kenji Imamura, Kuniko Saito, Kugatsu Sadamitsu, Hitoshi Nishikawa

2014 年 21 巻 4 号 p. 941-963
発行日: 2014/09/01
公開日: 2014/12/01

DOIhttps://doi.org/10.5715/jnlp.21.941

ジャーナルフリー

抄録を表示する抄録を非表示にする

This paper shows how to correct the grammatical errors of Japanese particles made by Japanese learners. Our method is based on discriminative sequence conversion, which converts one sequence of words into another and corrects particle errors by substitution, insertion, or deletion. However, it is difficult to collect large learners’ corpora. We solve this problem with a discriminative learning framework that uses the following two methods. First, language model probabilities obtained from large, raw text corpora are combined with n-gram binary features obtained from learners’ corpora. This method is applied to measure the accuracy of Japanese sentences. Second, automatically generated pseudo-error sentences are added to learners’ corpora to enrich the corpora directly. Furthermore, we apply domain adaptation, in which the pseudo-error sentences (the source domain) are adapted to the real error sentences (the target domain). Experiments show that the recall rate is improved using both language model probabilities and n-gram binary features. Stable improvement is achieved using pseudo-error sentences with domain adaptation.

抄録全体を表示

PDF形式でダウンロード (609K)

J-STAGEへの登録はこちら（無料）