Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 18, Issue 2
Displaying 1-7 of 7 articles from this issue
Preface
Paper
  • Shinsuke Mori, Tetsuro Sasada, Graham Neubig
    2011 Volume 18 Issue 2 Pages 71-87
    Published: 2011
    Released on J-STAGE: September 28, 2011
    JOURNAL FREE ACCESS
    In this paper, first we propose a language model based on pairs of word and input sequence. Then we propose the notion of a stochastically tagged corpus to cope with tag estimation errors. The experimental results we conducted using kana-kanji converters showed that our ideas, the language model based on pairs of word and input sequence and the notion of a stochastically tagged corpus, both improved the accuracy. Therefore we conclude that the language model based on pairs and the notion of a stochastically tagged corpus are effective in language modeling for the kana-kanji conversion task.
    Download PDF (461K)
  • Naoaki Okazaki, Jun’ichi Tsujii
    2011 Volume 18 Issue 2 Pages 89-117
    Published: 2011
    Released on J-STAGE: September 28, 2011
    JOURNAL FREE ACCESS
    This paper presents a simple and fast algorithm for approximate string matching in which string similarity is computed by set similarity measures including cosine, Dice, Jaccard, or overlap coefficient. In this study, strings are represented by unordered sets of arbitrary features (e.g., tri-grams). Deriving necessary and sufficient conditions for approximate string, we show that approximate string matching is exactly solvable by τ-overlap join. We propose CPMerge algorithm that solves τ-overlap join efficiently by making use of signatures in query features and a pruning condition. In addition, we describe implementation considerations of the algorithm. We measure the query performance of approximate string matching by using three large-scaled datasets with English person names, Japanese unigrams, and biomedical entity/concept names. The experimental results demonstrate that the proposed method outperforms state-of-the-art methods including Locality Sensitive Hashing and DivideSkip on all the datasets. We also analyze the behavior of the proposed method on the datasets. We distribute SimString, a library implementation of the proposed method, in an open-source license.
    Download PDF (739K)
  • Daisuke Kimura, Kumiko Tanaka-Ishii
    2011 Volume 18 Issue 2 Pages 119-137
    Published: 2011
    Released on J-STAGE: September 28, 2011
    JOURNAL FREE ACCESS
    This paper considers various measures which become constant for any large lengths of a given natural language text. Consideration of such measures gives some hints for studies of complexity of natural language. Previously, such measures have been studied mainly for relatively small English texts. In this work, we consider the measures for texts other than English and also for large scale texts. Among the measures, we consider Yule’s K, Orlov’s Z, and Golcher’s VM, which are previously empirically argued their convergence, and in addition, the entropy H, and r, the measure related to the scale-free network. Our experiments show that both K and VM are convergent for texts of various language, whereas the other measures are not.
    Download PDF (991K)
  • Shinsuke Mori, Hiroki Oda
    2011 Volume 18 Issue 2 Pages 139-152
    Published: 2011
    Released on J-STAGE: September 28, 2011
    JOURNAL FREE ACCESS
    In this paper we propose a new method for automatically segmenting a sentence in Japanese into a word sequence. The main advantage of our method is that the segmenter is, by using a maximum entropy framework, capable of referring to a list of compound words, i.e. word sequences without boundary information. This allows for a higher segmentation accuracy in many real situations where only some electronic dictionaries, whose entries are not consistent with the word segmentation standard, are available. Our method is also capable of exploiting a list of word sequences. It allows us to obtain a far greater accuracy gain with low manual annotation cost.
    We prepared segmented corpora, a compound word list, and a word sequence list. Then we conducted experiments to compare automatic word segmenters referring to various types of dictionaries. The results showed that the word segmenter we proposed is capable of exploiting a list of compound words and word sequences to yield a higher accuracy under realistic situations.
    Download PDF (346K)
  • Dittaya Wanvarie, Hiroya Takamura, Manabu Okumura
    2011 Volume 18 Issue 2 Pages 153-173
    Published: 2011
    Released on J-STAGE: September 28, 2011
    JOURNAL FREE ACCESS
    We propose an active learning framework for sequence labeling tasks. In each iteration, a set of subsequences are selected and manually labeled, while the other parts of sequences are left unannotated. The learning will stop automatically when the training data between consecutive iterations does not significantly change. We evaluate the proposed framework on chunking and named entity recognition data provided by CoNLL. Experimental results show that we succeed in obtaining the supervised F1 only with 6.98%, and 7.01% of tokens being annotated, respectively.
    Download PDF (555K)
Report
  • Chikara Hashimoto, Sadao Kurohashi, Daisuke Kawahara, Keiji Shinzato, ...
    2011 Volume 18 Issue 2 Pages 175-201
    Published: 2011
    Released on J-STAGE: September 28, 2011
    JOURNAL FREE ACCESS
    There has been a growing interest in the technologies of information access and analysis targetting blog articles recently. In order to provide the research community with the basic data, we constructed a blog corpus that consists of 249 articles (4,186 sentences) and has the following features: i) Annotated with sentence boundaries. ii) Annotated with grammatical information about morphology, dependency, case, anaphora, and named entities, in a way consistent with Kyoto University Text Corpus. iii) Annotated with sentiment information. iv) Provided with HTML files that visualize all the annotations above. We asked 81 university students to write blog articles about either the sightseeing of Kyoto, cellphones, sports, or gourmet. In constructing the annotated blog corpus, we faced problems concerning sentence boundaries, parentheses, errata, dialect, a variety of smiley, and other morphological variations. In this paper, we describe the specification of the corpus and how we dealt with the above problems.
    Download PDF (703K)
feedback
Top