Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 17, Issue 3
Displaying 1-7 of 7 articles from this issue
Preface
Article
  • Ryo Nishimura, Yasuhiko Watanabe, Masaki Murata, Yasuhito Oota, Yoshih ...
    2010 Volume 17 Issue 3 Pages 3_3-3_23
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    In order to improve the readability, we often segment a mail text into smaller paragraphs than necessary. However, this oversegmentation is a problem of mail text processing. It would negatively affect discourse analysis, information extraction, information retrieval, and so on. To solve this problem, we propose methods of estimating the connectivity between paragraphs in a mail. In this paper, we compare paragraph connectivity estimation based on machine learning methods (SVM and ME) with a rule-based method and show that the machine learning methods outperform the rule-based method.
    Download PDF (508K)
  • Takeshi Abekawa, Manabu Okumura
    2010 Volume 17 Issue 3 Pages 3_25-3_39
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    In this paper, we propose a method for exploring the Japanese construction N1-Adj-N2, which often establishes a relationship between an object (N2), an attribute (N1), and an evaluation of that attribute (Adj). As this construction connects two nouns, our method involves constructing a graph of the noun relations, which can be considered as representing selectional restrictions for the argument of a target adjective. The exploration of N1-Adj-N2 constructions is useful for opinion mining, lexicographical analysis of adjectives, and writing aid, among others.
    Download PDF (304K)
  • Oanh Thi Tran, Cuong Anh Le, Thuy Quang Ha
    2010 Volume 17 Issue 3 Pages 3_41-3_60
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.
    Download PDF (377K)
  • Kun Yu, Yusuke Miyao, Takuya Matsuzaki, Xiangli Wang, Yaozhong Zhang, ...
    2010 Volume 17 Issue 3 Pages 3_61-3_80
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Comparing with the traditional way of manually developing grammar based on linguistic theory, corpus-oriented grammar development is more promising. To develop HPSG grammar through the corpus-oriented way, a treebank is an indispensable part. This paper first compares existing Chinese treebanks and chooses one of them as the basic resource for HPSG grammar development. Then it proposes a new design of part-of-speech tags based on the assumption that it is not only simple enough to reduce ambiguity of morphological analysis as much as possible, but also rich enough for HPSG grammar development. Finally, it introduces some on-going work about utilizing a Chinese scientific paper treebank in HPSG grammar development.
    Download PDF (190K)
  • —Towards Translation into Logical Forms—
    Makoto Nakamura, Yusuke Kimura, Minh Quang Nhat Pham, Le Minh Nguyen, ...
    2010 Volume 17 Issue 3 Pages 3_81-3_100
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    This paper reports how to treat legal sentences including itemized expressions in three languages. Thus far, we have developed a system for translating legal sentences into logical formulae. Although our system basically converts words and phrases in a target sentence into predicates in a logical formula, it generates some useless predicates for itemized and referential expressions. In the previous study, focusing on Japanese Law, we have made a front end system which substitutes corresponding referent phrases for these expressions. In this paper, we examine our approach to the Vietnamese Law and the United States Code. Our linguistic analysis shows the difference in notation among languages or nations, and we extracted conventional expressions denoting itemization for each language. The experimental result shows high accuracy in terms of generating independent, plain sentences from the law articles including itemization. The proposed system generates a meaningful text with high readability, which can be input into our translation system.
    Download PDF (1746K)
  • Eric Nichols, Francis Bond, D. Scott Appling, Yuji Matsumoto
    2010 Volume 17 Issue 3 Pages 3_101-3_122
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.
    Download PDF (261K)
feedback
Top