Journal of Natural Language Processing

Preface

[title in Japanese]

[in Japanese]

2010Volume 17Issue 4 Pages 4_1-4_2
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_1

JOURNAL FREE ACCESS

Download PDF (109K)

Paper

Tag Confidence Measure for Semi-Automatically Updating Named Entity Recognition

Kuniko Saito, Kenji Imamura

2010Volume 17Issue 4 Pages 4_3-4_21
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_3

JOURNAL FREE ACCESS

Show abstractHide abstract

We present two techniques to reduce machine learning cost, i.e., cost of manually annotating unlabeled data, for adapting existing CRF-based named entity recognition (NER) systems to new texts or domains. We introduce the tag posterior probability as the tag confidence measure of an individual NE tag determined by the base model. Dubious tags are automatically detected as recognition errors, and regarded as targets of manual correction. Compared to entire sentence posterior probability, tag posterior probability has the advantage of minimizing system cost by focusing on those parts of the sentence that require manual correction. Using the tag confidence measure, the first technique, known as active learning, asks the editor to assign correct NE tags only to those parts that the base model could not assign tags confidently. Active learning reduces the learning cost by 66%, compared to the conventional method. As the second technique, we propose bootstrapping NER, which semi-automatically corrects dubious tags and updates its model.

View full abstract

Download PDF (678K)
Web Document Clustering Based on the Clusters of Topic Words

Tomoya Nishina, Akira Utsumi

2010Volume 17Issue 4 Pages 4_23-4_41
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_23

JOURNAL FREE ACCESS

Show abstractHide abstract

Many Web page clustering systems construct clusters in such a way that, for each of the extracted keywords, one cluster is constructed to contain all the pages that contain this keyword. However, these systems suffer from one serious problem that similar clusters (i.e., clusters that share many Web pages) are likely to be generated from similar keywords, because their clustering method fails to take into account the topical similarity between keywords. To overcome this problem, this study proposes a new Web page clustering method that uses the topical similarity between words. The proposed method first extracts keywords that are dissimilar to each other using distributional statistics of word occurrence in snippets and titles of search results. After that, in order to reduce the number of unclassified Web pages, the method generates word groups each of which is a set of words similar to each extracted keyword, and then constructs Web page clusters using the word groups, rather than directly generating Web page clusters from keywords. This study also conducts an evaluation experiment in which our method is compared with the existing method that ignores the similarity of keywords using the handmade test data. The result is that our system achieves better performance and can overcome the problem of multiple similar clusters.

View full abstract

Download PDF (397K)
A Semantic Analysis System of Japanese Assisted
by a Thesaurus

Yoshihiro Kokubu, Kouji Umekita, Eiichi Matsushita, Takashi Sueoka

2010Volume 17Issue 4 Pages 4_43-4_57
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_43

JOURNAL FREE ACCESS

Show abstractHide abstract

Because Consumer Generated Media have spread, language processing technologies for that purpose are necessary. The improvement of parsing precision is demanded for both retrieval by a natural sentence and translation of such text data. We realize processing methods which can deal with analysis errors caused by fluctuating terms and ambiguous sentence structures. Specifically, we propose using a thesaurus to decide semantic distance between the terms. We have realized a system which standardizes the terms and normalizes the syntactic dependencies. Further, we examine the internal structure of predicates to recover omitted subjects and determine the “intention of a predicate”. When we analyze texts of “Yahoo! Chiebukuro”, the precision improves by about 1% compared with when the thesaurus is not used. We summarize the contents of the dictionaries our system uses.

View full abstract

Download PDF (422K)
Generalization of Semantic Roles in Automatic
Semantic Role Labeling

Yuichiroh Matsubayashi, Naoaki Okazaki, Jun’ichi Tsujii

2010Volume 17Issue 4 Pages 4_59-4_89
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_59

JOURNAL FREE ACCESS

Show abstractHide abstract

A number of studies have applied machine-learning approaches to semantic role labeling with availability of corpora such as FrameNet and PropBank. These corpora define frame-specific semantic roles for each frame. It is crucial for the machine-learning approach because the corpus contain a number of infrequent roles which hinder an efficient learning. This paper focus on a generalization problem of semantic roles in a semantic role labeling task. We compare existing generalization criteria and our novel criteria, and clarify characteristics of each criterion. We also show that using multiple generalization criteria in a model improves the performance of a semantic role classification. In experiments on FrameNet, we achieved 19.16% error reduction in terms of total accuracy and 7.42% in macro F1 avarage. On PropBank, we reduced 24.07% of errors in total accuracy, and 26.39% of errors in the evaluation for unseen verbs.

View full abstract

Download PDF (1906K)
An Improvement of Example-based Emotion Estimation Using Similarity between Sentence and each Corpus

Kenichi Mishina, Seiji Tsuchiya, Motoyuki Suzuki, Fuji Ren

2010Volume 17Issue 4 Pages 4_91-4_110
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_91

JOURNAL FREE ACCESS

Show abstractHide abstract

Example-based emotion estimators need an emotion corpus in which each sentences are assigned with emotion tags. It is difficult to determine emotion tags for the sentence consistently because of ambiguity of emotion. As a result, there are several wrong tags in a corpus. It causes decrease in the performance of an emotion estimation. In order to solve the problem, a new similarity between input sentence and emotion corpus is proposed. This similarity is based on frequencies of morpheme N-gram of the both input sentence and corpus. Experimental results show that the proposed method improves emotion precision from 60.3% to 81.8%.

View full abstract

Download PDF (585K)
Study for Prosodic Control Command Generation of Synthetic Speech

Osamu Mizuno, Masanobu Abe

2010Volume 17Issue 4 Pages 4_111-4_129
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_111

JOURNAL FREE ACCESS

Show abstractHide abstract

The Multi-layered Speech/Sound Synthesis Control Language (MSCL) proposed herein facilitates the synthesizing of several speech modes such as nuance, mental state and emotion, and allows speech to be synchronized to other media easily. MSCL is a multi-layered linguistic system and encompasses three layers: and semantic level layer (The S-layer), interpretation level layer (The I-layer), and parameter level layer (The P-layer). This multi-level description system is convenient for both laymen and professional users. Furthermore, research was conducted into mental state tendencies using a test that examined the perceptions of the subject’s sensibility to the control of synthetic speech prosody. The results showed the relationships between prosodic control rules and non-verbal expressions. These relationships are of use for constructing semantic prosody control. This paper describes these functions and the effective prosodic feature controls possible with MSCL.

View full abstract

Download PDF (591K)
Kana-Kanji Conversion by Using Unknown Word-Pronunciation Pairs with Contexts

Tetsuro Sasada, Shinsuke Mori, Tatsuya Kawahara

2010Volume 17Issue 4 Pages 4_131-4_153
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_131

JOURNAL FREE ACCESS

Show abstractHide abstract

One of the significant problems of kana-kanji conversion (KKC) systems is unknown words. In this paper, for the purpose of improvement in KKC accuracy, we propose a method for extracting unknown words, their pronunciations and their contexts from similar sets of Japanese text data and speech data. Unknown word candidates are extracted from text data with a stochastic segmentation model, and their possible pronunciation entries are hypothesized. These entries are verified by conducting automatic speech recognition (ASR) on audio data on similar topics. As a result of ASR, we obtain a corpus for training a stochastic model for KKC. In the experiment, we use automatically-collected news articles and broadcast TV news covering similar topics. We made experimental evaluations with our KKC back-end enhanced with these corpora on other web news articles and observed an improvement in the accuracy.

View full abstract

Download PDF (493K)

Report

SMT with Handmade Phrase Table

Jin’ichi Murakami, Ryouta Kagami, Masato Tokuhisa, Satoru Ikehar ...

2010Volume 17Issue 4 Pages 4_155-4_175
Published: 2010
Released on J-STAGE: June 09, 2011

DOIhttps://doi.org/10.5715/jnlp.17.4_155

JOURNAL FREE ACCESS

Show abstractHide abstract

Recently, the statistical machine translation (SMT) method is very popular for machine translation. This SMT method uses an automatically calculated translation model and language model for large translation pair sentences. The translation model provides the probability that the foreign string is the translation of the native string and is normally controlled using a phrase table. However, the phrase table is automatically made; it has high coverage but low reliability. On the other side, there are many translation word pairs made by hand, especially in Japanese English translation. These translation word pairs have low coverage but high reliability. Therefore, we added these handmade translation word pairs into the automatically made phrase table. In this paper, we used 130,000 translation word pairs and the phrase table with added word pairs. As a result of the experiments, we obtained a BLUE score of 13.4% for simple sentences and 8.5% for complex sentences. On the other side, with the base line system, the score was 12.5% for simple sentences and 7.7% for complex sentences. We also studied an ABX test. In simple sentences, 5 sentences were good using the base line, and 23 sentences were good using the proposed method. In complex sentences, 15 sentences were good using the base line, and 35 sentences were good using the proposed method. As a result of these experiments, the effectiveness of the proposed method was shown.

View full abstract

Download PDF (402K)

Register with J-STAGE for free!