Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 17, Issue 1
Displaying 1-14 of 14 articles from this issue
Preface
Memorial writing
Paper
  • Takehiko Yoshimi, Katsunori Kotani, Takeshi Kutsumi, Ichiko Sata, Hito ...
    2010 Volume 17 Issue 1 Pages 1_7-1_28
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    This paper presents a method of automatically evaluating the fluency of machine-translated sentences. We constructed a classifier that would distinguish machine translations from human translations, using Support Vector Machines as machine learning algorithms. In order to obtain a clue to the distinction, we focused on literal translations (word-for-word translations). The classifier was constructed based on features derived from word alignment distributions between source sentences and human/machine translations. Our method employed parallel corpora to construct the classifier but required neither manually labeled training examples nor multiple reference translations to evaluate new sentences. We confirmed that our method could assist evaluation on system level. We also found that this method could identify the qualitative characteristics of machine translations, which would greatly help improve the translation quality.
    Download PDF (413K)
  • Tomohisa Sano, Shiho Hoshi Nobesawa, Hiroyuki Okamoto, Hiroya Susuki, ...
    2010 Volume 17 Issue 1 Pages 1_29-1_54
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    There have been many researches on toponym resolution as an approach to solve the unknown word problem. In this paper we propose an area candidate estimation method for toponyms, to assign area information to unknown toponyms. Our aim is to expand the target toponyms to non-restricted domains. Thus we aim for a simple system avoiding the use of gazeteers and context information. Our method is based only on surface information to estimate area candidates to where the toponyms may belong. Toponym resolution can be difficult because of linguistic or geographic reasons. Focusing on the surface difference among probable countries, we constructed a system containing a reduction phase for a rough examination and a selection phase for a detailed examination among them. By our effective combination of these two phases, we succeeded in gaining high precision rate maintaing high recall rate.
    Download PDF (621K)
  • Yugo Murawaki, Sadao Kurohashi
    2010 Volume 17 Issue 1 Pages 1_55-1_75
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    To solve the unknown morpheme problem in Japanese morphological analysis, we propose a novel framework of online unknown morpheme acquisition and its implementation. In online unknown morpheme acquisition, an unknown morpheme acquirer, which works in concert with the morphological analyzer, detects unknown morphemes from each segmented and POS-tagged sentence, enumerates its possible interpretations, and selects the best candidate. In enumeration, morphological constraints of the Japanese language are utilized, and selection is done by comparing multiple examples kept in storage. When the number of examples being compared is large enough for disambiguation, the acquirer directly updates the dictionary of the analyzer, and the acquired morpheme will be used in subsequent analysis. Experiments show that unknown morphemes are acquired from relatively small numbers of examples with high accuracy and improve the quality of morphological analysis.
    Download PDF (1007K)
  • Toru Onoe, Katsuhiro Hirata, Masayuki Okabe, Kyoji Umemura
    2010 Volume 17 Issue 1 Pages 1_77-1_97
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    Feature selection for text classification is a procedure that categorizes words or strings to improve the classification performance. This operation is especially important when we use substrings as a feature because the number of substrings in a given data set is usually quite large.
    In this paper, we focus on the substring feature selection technique and describe a method that uses a statistic score called “adaptation” as a measure for the selection. Adaptation works on the assumption that strings appearing more than twice in a document have a high probability of being keywords; we expect this feature to be an effective tool for text classification. We compared our method with a state-of-the-art method proposed by Zhang et al. that identifies a substring feature set by removing redundant substrings that are similar in terms of statistical distribution. We evaluate the classification results by F-measure that is a harmonic mean of precision and recall. An experiment on news classification demonstrated that our method outperformed Zhang’s by 3.74% (it improves Zhang’s result from 79.65% to 83.39%) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed Zhang’s by 2.93% (it improves the Zhang’s result from 90.23% to 93.15%). We verified existence of significant difference between the results in each experiment.
    An experiment on news classification shows that our method is worse than a method of using word for feature by 0.49% (although there is no significant difference) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed the word method by 1.04% (our method improves its result from 92.11% to 93.15%). We verified that there is a significant difference between the results in spam classification experiment.
    Zhang’s method tends to extract substrings that are so short it is difficult to understand the original phrases from which they are extracted. This degrades classification performance because such a substring can be a part of many different words, some or most of which are unrelated to the original substring. Our method, on the other hand, avoids this pitfall because it selects substrings containing a limited number of original words. Selecting substrings in such a manner is the key advantage of our method.
    Download PDF (404K)
  • Toshiaki Nakazawa, Sadao Kurohashi
    2010 Volume 17 Issue 1 Pages 1_99-1_120
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    When aligning very different language pairs, the most important needs are the use of structural information and the capability of generating one-to-many or many-to-many correspondences. In this paper, we propose a novel phrase alignment method which models word or phrase dependency relations in dependency tree structures of source and target languages. The dependency relation model is a kind of tree-based reordering model, and can handle non-local reorderings which sequential word-based models often cannot handle properly. The model is also capable of estimating phrase correspondences automatically without any heuristic rules. Experimental results of alignment show that our model could achieve F-measure 8.5 points higher than the conventional word alignment model with symmetrization algorithms.
    Download PDF (1020K)
  • Shuya Abe, Kentaro Inui, Yuji Matsumoto
    2010 Volume 17 Issue 1 Pages 1_121-1_139
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    Addressing the task of acquiring semantic relations between events from a large corpus, we first argue the complementarity between the pattern-based relation-oriented approach and the anchor-based argument-oriented approach. We then proposes a two-phased approach, which first uses lexico-syntactic patterns to acquire predicate pairs and then uses two types of anchors to identify shared arguments. The present results of our empirical evaluation on a large-scale Japanese Web corpus have shown that (a) the anchor-based filtering extensively improves the precision of predicate pair acquisition, (b) the two types anchors are almost equally contributive and combining them improves recall without losing precision, and (c) the anchor-based method achieves high precision also in shared argument identification.
    Download PDF (552K)
  • Mamoru Komachi, Ryu Iida, Kentaro Inui, Yuji Matsumoto
    2010 Volume 17 Issue 1 Pages 1_141-1_159
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    As fundamental natural language processing techniques like morphological analysis and parsing have become widely used, semantics and discourse analysis has gained increasing attention. Especially, it is essential to identify fundamental elements, or arguments, such as “who” did “what” to “whom.” Predicate argument structure analysis deals with argument structure of verbs and adjectives. However, not only verbs and adjectives but also nouns are known to have event-hood. We thus propose a machine-learning based method for automatic argument structure analysis of Japanese event-nouns. Since there are ambiguous event-nouns in terms of event-hood, we cast the task of argument structure analysis of event-nouns into two parts: event-hood determination and argument identification. We propose to use lexico-syntactic patterns mined from large corpora for the first sub-task and to exploit argument-sharing phenomenon between predicates and event-nouns for the second sub-task.
    Download PDF (445K)
  • Hikaru Yokono, Manabu Okumura
    2010 Volume 17 Issue 1 Pages 1_161-1_182
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    This paper describes improvements made to the entity grid local coherence model for Japanese text. We investigate the effectiveness of taking into account cohesive devices, such as conjunction, explicit reference relation, lexical cohesion, and refining syntactic roles for a topic marker in Japanese. To take into account lexical cohesion, we consider lexical chaining. Through the experiments on discrimination where the system has to select the more coherent sentence ordering, and comparison of the system’s ranking of automatically created summaries against human judgment based on quality questions, we show that these factors contribute to improve the performance of the entity grid model.
    Download PDF (452K)
  • Atsushi Fujita, Satoshi Sato
    2010 Volume 17 Issue 1 Pages 1_183-1_219
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    The most critical issue in generating and recognizing paraphrases is developing a wide-coverage paraphrase knowledge base. To attain the coverage of paraphrases that should not necessarily be represented at surface level, researchers have attempted to represent them with general transformation patterns. However, this approach does not prevent spurious paraphrases because there is no practical method to assess whether or not each instance of those patterns properly represents a pair of paraphrases. This paper argues on the measurement of the appropriateness of such automatically generated paraphrases, particularly targeting at morpho-syntactic paraphrases of predicate phrases. We first specify the criteria that a pair of expressions must satisfy to be regarded as paraphrases. On the basis of the criteria, we then examine several measures for quantifying the appropriateness of a given pair of expressions as paraphrases of each other. In addition to existing measures, a probabilistic model consisting of two distinct components is examined. The first component of the probabilistic model is a structured N-gram language model that quantifies the grammaticality of automatically generated expressions. The second component approximates the semantic equivalence and substitutability of the given pair of expressions on the basis of the distributional hypothesis. Through an empirical experiment, we found (i) the effectiveness of contextual similarity in combination with the constituent similarity of morpho-syntactic paraphrases and (ii) the versatility of the Web for representing the characteristics of predicate phrases.
    Download PDF (393K)
  • Naoya Inoue, Ryu Iida, Kentaro Inui, Yuji Matsumoto
    2010 Volume 17 Issue 1 Pages 1_221-1_246
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    An anaphoric relation can be either direct or indirect. In some cases, the antecedent being referred to lies outside of the discourse its anaphor belongs to. Therefore, an anaphora resolution model needs to consider the following two decisions in parallel: antecedent selection–selecting the antecedent itself, and anaphora type classification–classifying an anaphor into direct anaphora, indirect anaphora or exophora. However, there are non-trivial issues for taking these decisions into account in anaphora resolution models since the anaphora type classification has received little attention in the literature. In this paper, we address three non-trivial issues: (i) how the antecedent selection model should be designed, (ii) what information helps with anaphora type classification, (iii) how the antecedent selection and anaphora type classification should be carried out, taking Japanese as our target language. Our findings are: first, an antecedent selection model should be trained separately for each anaphora type using the information useful for identifying its antecedent. Second, the best candidate antecedent selected by an antecedent selection model provides contextual information useful for anaphora type classification. Finally, the antecedent selection should be carried out before anaphora type classification.
    Download PDF (676K)
Report
  • Yoshihiro Kokubu, Hiroyuki Okano
    2010 Volume 17 Issue 1 Pages 1_247-1_263
    Published: 2010
    Released on J-STAGE: June 30, 2011
    JOURNAL FREE ACCESS
    Instead of a thesaurus specialized in conventional information retrieval, we developed a thesaurus of 420,000 terms for the purpose of natural language processing such as parsing or the term standardization. Because each entry term has a large number of terms with various semantic relations, we introduce a facet and classify them for finding relative terms easily. Furthermore, we distinguish discriminatory terms, and fluctuating Japanese spellings. We described points to keep in mind and future tasks in making a thesaurus. Our package has the connecting function with the Internet and the other dictionaries.
    Download PDF (879K)
feedback
Top