Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 17, Issue 4
Displaying 1-9 of 9 articles from this issue
Preface
Paper
  • Kuniko Saito, Kenji Imamura
    2010 Volume 17 Issue 4 Pages 4_3-4_21
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    We present two techniques to reduce machine learning cost, i.e., cost of manually annotating unlabeled data, for adapting existing CRF-based named entity recognition (NER) systems to new texts or domains. We introduce the tag posterior probability as the tag confidence measure of an individual NE tag determined by the base model. Dubious tags are automatically detected as recognition errors, and regarded as targets of manual correction. Compared to entire sentence posterior probability, tag posterior probability has the advantage of minimizing system cost by focusing on those parts of the sentence that require manual correction. Using the tag confidence measure, the first technique, known as active learning, asks the editor to assign correct NE tags only to those parts that the base model could not assign tags confidently. Active learning reduces the learning cost by 66%, compared to the conventional method. As the second technique, we propose bootstrapping NER, which semi-automatically corrects dubious tags and updates its model.
    Download PDF (678K)
  • Tomoya Nishina, Akira Utsumi
    2010 Volume 17 Issue 4 Pages 4_23-4_41
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Many Web page clustering systems construct clusters in such a way that, for each of the extracted keywords, one cluster is constructed to contain all the pages that contain this keyword. However, these systems suffer from one serious problem that similar clusters (i.e., clusters that share many Web pages) are likely to be generated from similar keywords, because their clustering method fails to take into account the topical similarity between keywords. To overcome this problem, this study proposes a new Web page clustering method that uses the topical similarity between words. The proposed method first extracts keywords that are dissimilar to each other using distributional statistics of word occurrence in snippets and titles of search results. After that, in order to reduce the number of unclassified Web pages, the method generates word groups each of which is a set of words similar to each extracted keyword, and then constructs Web page clusters using the word groups, rather than directly generating Web page clusters from keywords. This study also conducts an evaluation experiment in which our method is compared with the existing method that ignores the similarity of keywords using the handmade test data. The result is that our system achieves better performance and can overcome the problem of multiple similar clusters.
    Download PDF (397K)
  • Yoshihiro Kokubu, Kouji Umekita, Eiichi Matsushita, Takashi Sueoka
    2010 Volume 17 Issue 4 Pages 4_43-4_57
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Because Consumer Generated Media have spread, language processing technologies for that purpose are necessary. The improvement of parsing precision is demanded for both retrieval by a natural sentence and translation of such text data. We realize processing methods which can deal with analysis errors caused by fluctuating terms and ambiguous sentence structures. Specifically, we propose using a thesaurus to decide semantic distance between the terms. We have realized a system which standardizes the terms and normalizes the syntactic dependencies. Further, we examine the internal structure of predicates to recover omitted subjects and determine the “intention of a predicate”. When we analyze texts of “Yahoo! Chiebukuro”, the precision improves by about 1% compared with when the thesaurus is not used. We summarize the contents of the dictionaries our system uses.
    Download PDF (422K)
  • Yuichiroh Matsubayashi, Naoaki Okazaki, Jun’ichi Tsujii
    2010 Volume 17 Issue 4 Pages 4_59-4_89
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    A number of studies have applied machine-learning approaches to semantic role labeling with availability of corpora such as FrameNet and PropBank. These corpora define frame-specific semantic roles for each frame. It is crucial for the machine-learning approach because the corpus contain a number of infrequent roles which hinder an efficient learning. This paper focus on a generalization problem of semantic roles in a semantic role labeling task. We compare existing generalization criteria and our novel criteria, and clarify characteristics of each criterion. We also show that using multiple generalization criteria in a model improves the performance of a semantic role classification. In experiments on FrameNet, we achieved 19.16% error reduction in terms of total accuracy and 7.42% in macro F1 avarage. On PropBank, we reduced 24.07% of errors in total accuracy, and 26.39% of errors in the evaluation for unseen verbs.
    Download PDF (1906K)
  • Kenichi Mishina, Seiji Tsuchiya, Motoyuki Suzuki, Fuji Ren
    2010 Volume 17 Issue 4 Pages 4_91-4_110
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Example-based emotion estimators need an emotion corpus in which each sentences are assigned with emotion tags. It is difficult to determine emotion tags for the sentence consistently because of ambiguity of emotion. As a result, there are several wrong tags in a corpus. It causes decrease in the performance of an emotion estimation. In order to solve the problem, a new similarity between input sentence and emotion corpus is proposed. This similarity is based on frequencies of morpheme N-gram of the both input sentence and corpus. Experimental results show that the proposed method improves emotion precision from 60.3% to 81.8%.
    Download PDF (585K)
  • Osamu Mizuno, Masanobu Abe
    2010 Volume 17 Issue 4 Pages 4_111-4_129
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    The Multi-layered Speech/Sound Synthesis Control Language (MSCL) proposed herein facilitates the synthesizing of several speech modes such as nuance, mental state and emotion, and allows speech to be synchronized to other media easily. MSCL is a multi-layered linguistic system and encompasses three layers: and semantic level layer (The S-layer), interpretation level layer (The I-layer), and parameter level layer (The P-layer). This multi-level description system is convenient for both laymen and professional users. Furthermore, research was conducted into mental state tendencies using a test that examined the perceptions of the subject’s sensibility to the control of synthetic speech prosody. The results showed the relationships between prosodic control rules and non-verbal expressions. These relationships are of use for constructing semantic prosody control. This paper describes these functions and the effective prosodic feature controls possible with MSCL.
    Download PDF (591K)
  • Tetsuro Sasada, Shinsuke Mori, Tatsuya Kawahara
    2010 Volume 17 Issue 4 Pages 4_131-4_153
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    One of the significant problems of kana-kanji conversion (KKC) systems is unknown words. In this paper, for the purpose of improvement in KKC accuracy, we propose a method for extracting unknown words, their pronunciations and their contexts from similar sets of Japanese text data and speech data. Unknown word candidates are extracted from text data with a stochastic segmentation model, and their possible pronunciation entries are hypothesized. These entries are verified by conducting automatic speech recognition (ASR) on audio data on similar topics. As a result of ASR, we obtain a corpus for training a stochastic model for KKC. In the experiment, we use automatically-collected news articles and broadcast TV news covering similar topics. We made experimental evaluations with our KKC back-end enhanced with these corpora on other web news articles and observed an improvement in the accuracy.
    Download PDF (493K)
Report
  • Jin’ichi Murakami, Ryouta Kagami, Masato Tokuhisa, Satoru Ikehar ...
    2010 Volume 17 Issue 4 Pages 4_155-4_175
    Published: 2010
    Released on J-STAGE: June 09, 2011
    JOURNAL FREE ACCESS
    Recently, the statistical machine translation (SMT) method is very popular for machine translation. This SMT method uses an automatically calculated translation model and language model for large translation pair sentences. The translation model provides the probability that the foreign string is the translation of the native string and is normally controlled using a phrase table. However, the phrase table is automatically made; it has high coverage but low reliability. On the other side, there are many translation word pairs made by hand, especially in Japanese English translation. These translation word pairs have low coverage but high reliability. Therefore, we added these handmade translation word pairs into the automatically made phrase table. In this paper, we used 130,000 translation word pairs and the phrase table with added word pairs. As a result of the experiments, we obtained a BLUE score of 13.4% for simple sentences and 8.5% for complex sentences. On the other side, with the base line system, the score was 12.5% for simple sentences and 7.7% for complex sentences. We also studied an ABX test. In simple sentences, 5 sentences were good using the base line, and 23 sentences were good using the proposed method. In complex sentences, 15 sentences were good using the base line, and 35 sentences were good using the proposed method. As a result of these experiments, the effectiveness of the proposed method was shown.
    Download PDF (402K)
feedback
Top