Probabilistic context-free grammars (PCFGs) are a widely known class of probabilistic language models. The Inside-Outside (I-O) algorithm is well known as an efficient EM algorithm tailored for PCFGs. Although the algorithm requires inexpensive linguistic resources, there remains a problem in its efficiency. This paper presents an efficient method for training PCFG parameters in which the parser is separated from the EM algorithm, assuming that the underlying CFG is given. A new EM algorithm exploits the compactness of well-formed substring tables (WFSTs) generated by the parser. Our proposal is general in that the input grammar need not take Chomsky normal form (CNF) while it is equivalent to the I-O algorithm in the CNF case. In addition, we propose a polynomial-time EM algorithm for CFGs with context-sensitive probabilities, and report experimental results with the ATR dialogue corpus and a hand-crafted Japanese grammar.
This paper describes a Japanese parsing system with a linguistically fine-grained grammar based on Lexical-Functional Grammar (LFG). The system is the first Japanese LFG parser with over 97% coverage of real-world text. We evaluated the accuracy of the system by comparing it with standard Japanese dependency parsers. The LFG parser shows roughly equivalent performance in dependency accuracy with standard parsers. It also provides reasonably accurate results of case detection.
This article proposes an automatic method of gradually constructing case frames. First, a large raw corpus is parsed, and base case frames are constructed from reliable predicate-argument examples in the parsing results. Second, case analysis based on the base case frames is applied to the large corpus, and the case frames are upgraded by incorporating newly acquired information. Case frames are gradually fertilized in this way. We constructed case frames from 26 years of newspaper articles consisting of approximately 26 million sentences. The case frames are evaluated manually as well as through syntactic and case analyses. These results presented the effectiveness of the constructed case frames.
A simpler distribution that fits empirical word distribution about as well as a negative binomial is the Katz K mixture. In the K mixture model, the basic assumption is that the conditional probabilities of repeats for a given word are determined by a constant decay factor that is independent of the number of occurrences which have taken place. However, the probabilities of the repeat occurrences are generally lower than the constant decay factor for the content-bearing words with few occurrences that have taken place. To solve this deficiency of the K mixture model, in-depth exploration of the characteristics of the conditional probabilities of repetitions, decay factors and their influences on modeling term distributions was conducted. Based on the results of this study, it appears that both ends of the distribution can be used to fit models. That is, not only can document frequencies be used when the instances of a word are few, but also tail probabilities (the accumulation of document frequencies). Both document frequencies for few instances of a word and tail probabilities for large instances are often relatively easy to estimate empirically. Therefore, we propose an effective approach for improving the K mixture model, where the decay factor is the combination of two possible decay factors interpolated by a function depending on the number of instances of a word in a document. Results show that the proposed model can generate a statistically significant better estimation of frequencies, especially the frequency estimation for a word with two instances in a document. In addition, it is shown that the advantages of this approach will become more evident in two cases, modeling the term distribution for the frequently used content-bearing word and modeling the term distribution for a corpus with a wide range of document length.
Preference dependency grammar (PDG) is a framework for integrating morphological, syntactic, and semantic analyses. PDG provides packed shared data structures that can efficiently encompass all possible interpretations at each level of sentence analyses with preference scores. Using the structure, PDG can calculate a globally optimized interpretation for the target sentence. This paper first gives an overview of the PDG framework by describing the base model of PDG, which is a sentence analysis model, called a “multi-level packed shared data connection model.” Then this paper describes packed shared data structures, e.g., headed parse forests and dependency forests, adopted in PDG. Finally, the completeness and soundness of the mapping between the parse forest and the dependency forest are revealed.
We present an integrated probabilistic model for Japanese syntactic and case structure analysis. Syntactic and case structures are simultaneously analyzed on the basis of wide-coverage case frames that are constructed from a huge raw corpus in an unsupervised manner. This model selects the syntactic and case structures that have the highest generative probability. We evaluate both syntactic structure and case structure. In particular, the experimental results for syntactic analysis on web sentences show that the proposed model significantly outperforms the known syntactic analyzers.
The semantic relations between words are essential for natural language understanding. Toward deeper natural language understanding, we semi-automatically constructed a domain dictionary that represents the domain relations between fundamental Japanese words. Our method does not require a document collection. As a task-based evaluation of the domain dictionary, we categorized blogs by assigning a domain for each word in a blog article and categorizing it as the most dominant domain. Thus, we dynamically estimated the domains of unknown words, (i.e., those not listed in the domain dictionary), resulting in our blog categorization achieving an accuracy of 94.0% (564/600). Moreover, the domain estimation technique for unknown words achieved an accuracy of 76.6% (383/500).
Numerous studies have applied machine-learning approaches to semantic role labeling with the availability of corpora such as FrameNet and PropBank. These corpora define frame-specific semantic roles for each frame, which are problematic for a machine-learning approach because the corpus contains a number of infrequent roles that hinder efficient learning. This paper focuses on the generalization problem of semantic roles in a semantic role labeling task. We compare existing generalization criteria with our novel criteria, and clarify the characteristics of each criterion. We also show that using multiple generalization criteria in a single model improves the performance of a semantic role classification. In experiments on FrameNet, we achieved 19.16% error reduction in terms of total accuracy, and 7.42% in macro-averaged F1. On PropBank, we reduced 24.07% of errors in total accuracy, and 26.39% of errors in the evaluation for unseen verbs.
This paper considers different measures that might become constants for any length of a given natural language text. Such measures indicate a potential for studying the complexity of natural language but have previously only been studied using relatively small English texts. In this study, we consider measures for texts in languages other than English, and for large-scale texts. Among the candidate measures, we consider Yule's K, Orlov's Z, and Golcher's VM, each of whose convergence has been previously argued empirically. Furthermore, we introduce entropy H, and a measure, r, related to the scale-free property of language. Our experiments show that both K and VM are convergent for texts in various languages, whereas the other measures are not.
Word boundaries within noun compounds in a number of languages, including Japanese, are not marked by white spaces. Thus, it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds composed of katakana words are particularly difficult to split because katakana words are highly productive and are often out of vocabulary. Therefore, we propose using paraphrasing and back-transliteration of katakana noun compounds to split them. Experiments in which paraphrases and back-transliterations from unlabeled textual data were extracted and used to construct splitting models improved splitting accuracy with statistical significance.
Most relevance feedback methods re-rank search results using only the information of surface words in texts. We present a method that uses not only the information of surface words but also that of latent words that are inferred from texts. We infer latent word distribution in each document in the search results using latent Dirichlet allocation (LDA). When feedback is given, we also infer the latent word distribution in the feedback using LDA. We calculate the similarities between the user feedback and each document in the search results using both the surface and latent word distributions and re-rank the search results on the basis of the similarities. Evaluation results show that when user feedback consisting of two documents (3,589 words) is given, the proposed method improves the initial search results by 27.6% in precision at 10 (P@10). Additionally, it proves that the proposed method can perform well even when only a small amount of user feedback is available. For example, an improvement of 5.3% in P@10 was achieved when user feedback constituted only 57 words.
This paper shows how to correct the grammatical errors of Japanese particles made by Japanese learners. Our method is based on discriminative sequence conversion, which converts one sequence of words into another and corrects particle errors by substitution, insertion, or deletion. However, it is difficult to collect large learners’ corpora. We solve this problem with a discriminative learning framework that uses the following two methods. First, language model probabilities obtained from large, raw text corpora are combined with n-gram binary features obtained from learners’ corpora. This method is applied to measure the accuracy of Japanese sentences. Second, automatically generated pseudo-error sentences are added to learners’ corpora to enrich the corpora directly. Furthermore, we apply domain adaptation, in which the pseudo-error sentences (the source domain) are adapted to the real error sentences (the target domain). Experiments show that the recall rate is improved using both language model probabilities and n-gram binary features. Stable improvement is achieved using pseudo-error sentences with domain adaptation.