Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 15, Issue 5
Displaying 1-9 of 9 articles from this issue
  • [in Japanese]
    2008 Volume 15 Issue 5 Pages 1-2
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (240K)
  • Idomucogiin Dawa, Satoshi Nakamura
    2008 Volume 15 Issue 5 Pages 3-21
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper discusses a segmentation approach of Mongolian for Cyrillic text for machine translation. Using this method, the processing of one-to-one word permutation between the variations of Mongolian and other languages, especially Altaic family languages like Japanese, becomes easier. Furthermore, it can be used for two-way conversion between texts of Mongolian used in different regions and counties, such as Mongolia and China. Our system has been implemented based on DP (dynamic programming) matching supported by knowledge-based sequence matching, referred to as a multilingual dictionary and linguistic rule bank (LRB), and a data-driven approach of the target language corpus (TLC). For convenience, NM (New Mongolian) is treated as the source language, and TM (Traditional Mongolian) and Todo as the target language in this test. Our application was tested using manually transcribed texts with sizes of 5, 000 sentences paralleled from NM to TM and Todo. We found that our method could achieve 91.9% of the transformation accuracy for “NM” to “TM” and 94.3% for “NM” to “Todo”.
    Download PDF (18744K)
  • MASATOSHI TSUCHIYA, TOSHIYUKI WAKITA, AYU PURWARIANTI, SEIICHI NAKAGAW ...
    2008 Volume 15 Issue 5 Pages 23-43
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Cross-lingual language resources are necessary to realize cross-lingual natural language processing. A large translation dictionary is especially important as such a resource, however, large dictionaries are available for few language pairs and small ones are only available for most language pairs. We propose a novel method to expand a small existing translation dictionary to a large translation dictionary using a pivot language. Cooccurrence vectors in the source language and ones in the destination language are compared based on the small existing translation dictionary, and provide information to select appropriate translations among translation candidates gotten from transitive translation using two translation dictionaries. Experiments that expand the Indonesian-Japanese dictionary using the English language as a pivot language show that the proposed method can improve performance of a real CLIR system.
    Download PDF (7294K)
  • HIKARU YOKONO
    2008 Volume 15 Issue 5 Pages 45-71
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Since a story consists of several scenes and topics, for making a summary of a story, it is essential to get hold of relations between topics. This means that to make a coherent summary is a key issue for informative summary of a story. On the basis of this background, in this paper, the author proposes a method to produce a coherent summary of story focusing on extracting (1) topic block that consists of sentences that may be written on the same topic, and (2) complement sentences that may express change of scenes. They are extracted on the basis of automatic topic recognition and identification of characters. The experimental results of summarization for 9 stories show the proposed method produces easier-to-follow summaries than those of a tf·idf based model.
    Download PDF (2624K)
  • CHIKARA HASHIMOTO, SADAO KUROHASHI
    2008 Volume 15 Issue 5 Pages 73-97
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    For natural language understanding, it is essential to reveal semantic relations between. words. To date, only the IS-A relation has been publicly available as a thesaurus. Toward deeper natural language understanding, we semi-automatically constructed the domain dictionary that represents the domain relation between Japanese fundamental words. Our method does not require a document collection. As a task-based evaluation of the domain dictionary, we performed blog categorization, where we assigned a domain for each word in a blog article and categorize it as the most dominant domain. In so doing, we dynamically estimated the domains of unknown words, i.e., those not listed in the domain dictionary. As a result, our blog categorization achieved the accuracy of 94.0% (564/600). Also, the domain estimation technique for unknown words achieved the accuracy of 76.6% (383/500).
    Download PDF (2304K)
  • RYOHEI SASANO, SADAO KUROHASHI
    2008 Volume 15 Issue 5 Pages 99-118
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We present a knowledge-rich approach to Japanese coreference resolution. In Japanese, noun phrase coreference occupies a central position in coreference relations. To improve coreference resolution for such language, wide-coverage knowledge of synonyms is required. We first acquire knowledge of synonyms from large raw corpus and dictionary definition sentences, and then resolve coreference relations based on the knowledge. Furthermore, to boost the performance of coreference resolution, we integrate bridging reference resolution system that uses automatically constructed nominal case frames into coreference resolver. We evaluated our approach on news paper article and WEB corpus and confirmed that the performance of coreference resolution is improved by using automatically acquired synonyms and bridging reference resolution.
    Download PDF (2152K)
  • Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama
    2008 Volume 15 Issue 5 Pages 119-150
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Distributional similarity is a widely adopted concept to capture the semantic relatedness of words based on their context in various NLP tasks. While accurate similarity calculation requires a huge number of context types and co-occurrences, the contribution to the similarity calcualtion depends on individual context types, and some of them even act as noise. To select well-performing context and alleviate the high computational cost, we propose and investigate the effectiveness of three context selection schemes: category-based, type-based, and co-occurrence based selection. Categorybased selection is a conventional, simplest selection method which limits the context types based on the syntactic category. Finer-grained, type-based selection assigns importance scores to each context type, which we make possible by proposing a novel formalization of distibutional similarity as a classification problem, and applying feature selection techniques. The finest-grained, co-occurrence based selection assigns importance scores to each co-occurrence of words and context types. We evaluate the effectiveness and the trade-off between co-occurrence data size and synonym acquisition performance. Our experiments show that, on the whole, the finest-grained, co-occurrence based selection achieves better performane, although some of the simple category-based selection show comparable performance/cost trade-off.
    Download PDF (9940K)
  • Yo EHARA, KUMIKO TANAKA-ISHII
    2008 Volume 15 Issue 5 Pages 151-167
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Computer users increasingly need to produce text written in multiple languages. However, typical computer systems require the user to change the text entry software each time a different language is used. This is cumbersome, especially when the languages change frequently. To solve this problem, we propose TypeAny, a novel multilingual text entry system that identifies the language of the user's key entry and automatically dispatches the input to the appropriate text entry system. This language identification is modeled as a hidden Markov model whose probability is estimated by using the PPM method. When evaluating this method, we obtained language identification accuracy of 96.7% when an appropriate language had to be chosen from among three languages. The number of control actions needed to switch languages was decreased 93% when using TypeAny rather than a conventional method.
    Download PDF (4349K)
  • MASAKAZU IWATATE, MASAYUKI ASAHARA, YUJI MATSUMOTO
    2008 Volume 15 Issue 5 Pages 169-185
    Published: October 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In Japanese dependency parsing, Kudo's relative preference-based method outperforms both deterministic and probabilistic CFG-based parsing methods. In the relative preference-based method, a log-linear model estimates selectional preferences for all candidate heads, which cannot be considered in the deterministic parsing methods. We propose an algorithm based on a tournament model, in which the selectional preferences are directly modeled by one-on-one games in a step-ladder tournament. In evaluation experiment with Kyoto Text Corpus Version 4.0, the proposed method outperforms the previous research, including the relative preference-based method.
    Download PDF (1786K)
feedback
Top