In order to improve the readability, we often segment a mail text into smaller paragraphs than necessary. However, this oversegmentation is a problem of mail text processing. It would negatively affect discourse analysis, information extraction, information retrieval, and so on. To solve this problem, we propose methods of estimating the connectivity between paragraphs in a mail. In this paper, we compare paragraph connectivity estimation based on machine learning methods (SVM and ME) with a rule-based method and show that the machine learning methods outperform the rule-based method.
In this paper, we propose a method for exploring the Japanese construction N1-Adj-N2, which often establishes a relationship between an object (N2), an attribute (N1), and an evaluation of that attribute (Adj). As this construction connects two nouns, our method involves constructing a graph of the noun relations, which can be considered as representing selectional restrictions for the argument of a target adjective. The exploration of N1-Adj-N2 constructions is useful for opinion mining, lexicographical analysis of adjectives, and writing aid, among others.
Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.
Comparing with the traditional way of manually developing grammar based on linguistic theory, corpus-oriented grammar development is more promising. To develop HPSG grammar through the corpus-oriented way, a treebank is an indispensable part. This paper first compares existing Chinese treebanks and chooses one of them as the basic resource for HPSG grammar development. Then it proposes a new design of part-of-speech tags based on the assumption that it is not only simple enough to reduce ambiguity of morphological analysis as much as possible, but also rich enough for HPSG grammar development. Finally, it introduces some on-going work about utilizing a Chinese scientific paper treebank in HPSG grammar development.
This paper reports how to treat legal sentences including itemized expressions in three languages. Thus far, we have developed a system for translating legal sentences into logical formulae. Although our system basically converts words and phrases in a target sentence into predicates in a logical formula, it generates some useless predicates for itemized and referential expressions. In the previous study, focusing on Japanese Law, we have made a front end system which substitutes corresponding referent phrases for these expressions. In this paper, we examine our approach to the Vietnamese Law and the United States Code. Our linguistic analysis shows the difference in notation among languages or nations, and we extracted conventional expressions denoting itemization for each language. The experimental result shows high accuracy in terms of generating independent, plain sentences from the law articles including itemization. The proposed system generates a meaningful text with high readability, which can be input into our translation system.
Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.