In this paper, we solve the problem of extending various thesauri using a single method. Thesauri should be extended when unregistered terms are identified. Various thesauri are available, each of which is constructed according to a unique design principle. We formalise the extension of one thesaurus as a single classification problem in machine learning, with the goal of solving different classification problems. Applying existing classification methods to each thesaurus is time consuming, particularly if many thesauri must be extended. Thus, we propose a method to reduce the time required to extend multiple thesauri. In the proposed method, we first generate clusters of terms without the thesauri that are candidates for synonym sets based on formal concept analysis using the syntactic information of terms in a corpus. Reliable syntactic parsers are easy to use; thus, syntactic information is more available for many terms than semantic information. With syntactic information, for each thesaurus and for all unregistered terms, we can search candidate clusters quickly for a correct synonym set for fast classification. Experimental results demonstrate that the proposed method is faster than existing methods and classification accuracy is comparable.
Nonlocal dependencies represent syntactic phenomenon such as wh-movement, A-movement in passives, topicalization, raising, control, and right node raising. Nonlocal dependencies play an important role in semantic interpretation. This paper proposes a left-corner parser that identifies nonlocal dependencies. Our parser integrates nonlocal dependency identification into a transition-based system. We adopt a left-corner strategy in order to use the syntactic relation c-command, which plays an important role in nonlocal dependency identification. To utilize the global features captured by nonlocal dependencies, our parser uses a structured perceptron. In experimental evaluations, our parser achieved a good balance between constituent parsing and nonlocal dependency identification.
Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models. In this paper, we propose a generalized hierarchical word sequence framework, where different word association scores can be adopted to rearrange word sequences in a totally unsupervised fashion. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy. For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.
Learner English often contains grammatical errors with structural characteristics such as omissions, insertions, substitutions, and word order errors. These errors are not covered by the existing context-free grammar (CFG) rules. Therefore, it is not at all straightforward how to annotate learner English with phrase structures. Because of this limitation, there has been almost no work on phrase structure annotation for learner corpora despite its importance and usefulness. To address this issue, we propose a phrase structure annotation scheme for learner English, that consists of five principles. We apply the annotation scheme to two different learner corpora and show (i) its effectiveness at consistently annotating learner English with phrase structure (i.e., high inter-annotator agreement); (ii) the structural characteristics (CFG rules) of learner English obtained from the annotated corpora; and (iii) phrase structure parsing performance on learner English for the first time. We also release the annotation guidelines, the annotated data, and the parser model to the public.