Journal of Natural Language Processing

Preface

[title in Japanese]

[in Japanese]

2011 Volume 18 Issue 2 Pages 69-70
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.69

JOURNAL FREE ACCESS

Download PDF (154K)

Paper

Language Model Estimation from a Stochastically Tagged Corpus

Shinsuke Mori, Tetsuro Sasada, Graham Neubig

2011 Volume 18 Issue 2 Pages 71-87
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.71

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, first we propose a language model based on pairs of word and input sequence. Then we propose the notion of a stochastically tagged corpus to cope with tag estimation errors. The experimental results we conducted using kana-kanji converters showed that our ideas, the language model based on pairs of word and input sequence and the notion of a stochastically tagged corpus, both improved the accuracy. Therefore we conclude that the language model based on pairs and the notion of a stochastically tagged corpus are effective in language modeling for the kana-kanji conversion task.

View full abstract

Download PDF (461K)
A Simple and Fast Algorithm for Approximate String Matching with Set Similarity

Naoaki Okazaki, Jun’ichi Tsujii

2011 Volume 18 Issue 2 Pages 89-117
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.89

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper presents a simple and fast algorithm for approximate string matching in which string similarity is computed by set similarity measures including cosine, Dice, Jaccard, or overlap coefficient. In this study, strings are represented by unordered sets of arbitrary features (e.g., tri-grams). Deriving necessary and sufficient conditions for approximate string, we show that approximate string matching is exactly solvable by τ-overlap join. We propose CPMerge algorithm that solves τ-overlap join efficiently by making use of signatures in query features and a pruning condition. In addition, we describe implementation considerations of the algorithm. We measure the query performance of approximate string matching by using three large-scaled datasets with English person names, Japanese unigrams, and biomedical entity/concept names. The experimental results demonstrate that the proposed method outperforms state-of-the-art methods including Locality Sensitive Hashing and DivideSkip on all the datasets. We also analyze the behavior of the proposed method on the datasets. We distribute SimString, a library implementation of the proposed method, in an open-source license.

View full abstract

Download PDF (739K)
A Study on Constants of Natural Language Texts

Daisuke Kimura, Kumiko Tanaka-Ishii

2011 Volume 18 Issue 2 Pages 119-137
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.119

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper considers various measures which become constant for any large lengths of a given natural language text. Consideration of such measures gives some hints for studies of complexity of natural language. Previously, such measures have been studied mainly for relatively small English texts. In this work, we consider the measures for texts other than English and also for large scale texts. Among the measures, we consider Yule’s K, Orlov’s Z, and Golcher’s VM, which are previously empirically argued their convergence, and in addition, the entropy H, and r, the measure related to the scale-free network. Our experiments show that both K and VM are convergent for texts of various language, whereas the other measures are not.

View full abstract

Download PDF (991K)
Automatic Word Segmentation using Three Types of Dictionaries

Shinsuke Mori, Hiroki Oda

2011 Volume 18 Issue 2 Pages 139-152
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.139

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper we propose a new method for automatically segmenting a sentence in Japanese into a word sequence. The main advantage of our method is that the segmenter is, by using a maximum entropy framework, capable of referring to a list of compound words, i.e. word sequences without boundary information. This allows for a higher segmentation accuracy in many real situations where only some electronic dictionaries, whose entries are not consistent with the word segmentation standard, are available. Our method is also capable of exploiting a list of word sequences. It allows us to obtain a far greater accuracy gain with low manual annotation cost.
We prepared segmented corpora, a compound word list, and a word sequence list. Then we conducted experiments to compare automatic word segmenters referring to various types of dictionaries. The results showed that the word segmenter we proposed is capable of exploiting a list of compound words and word sequences to yield a higher accuracy under realistic situations.

View full abstract

Download PDF (346K)
Active Learning with Subsequence Sampling Strategy for Sequence Labeling Tasks

Dittaya Wanvarie, Hiroya Takamura, Manabu Okumura

2011 Volume 18 Issue 2 Pages 153-173
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.153

JOURNAL FREE ACCESS

Show abstractHide abstract

We propose an active learning framework for sequence labeling tasks. In each iteration, a set of subsequences are selected and manually labeled, while the other parts of sequences are left unannotated. The learning will stop automatically when the training data between consecutive iterations does not significantly change. We evaluate the proposed framework on chunking and named entity recognition data provided by CoNLL. Experimental results show that we succeed in obtaining the supervised F₁ only with 6.98%, and 7.01% of tokens being annotated, respectively.

View full abstract

Download PDF (555K)

Report

Construction of a Blog Corpus with Syntactic, Anaphoric, and Sentiment Annotations

Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Keiji Shinzato, ...

2011 Volume 18 Issue 2 Pages 175-201
Published: 2011
Released on J-STAGE: September 28, 2011

DOIhttps://doi.org/10.5715/jnlp.18.175

JOURNAL FREE ACCESS

Show abstractHide abstract

There has been a growing interest in the technologies of information access and analysis targetting blog articles recently. In order to provide the research community with the basic data, we constructed a blog corpus that consists of 249 articles (4,186 sentences) and has the following features: i) Annotated with sentence boundaries. ii) Annotated with grammatical information about morphology, dependency, case, anaphora, and named entities, in a way consistent with Kyoto University Text Corpus. iii) Annotated with sentiment information. iv) Provided with HTML files that visualize all the annotations above. We asked 81 university students to write blog articles about either the sightseeing of Kyoto, cellphones, sports, or gourmet. In constructing the annotated blog corpus, we faced problems concerning sentence boundaries, parentheses, errata, dialect, a variety of smiley, and other morphological variations. In this paper, we describe the specification of the corpus and how we dealt with the above problems.

View full abstract

Download PDF (703K)

Register with J-STAGE for free!