Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 15, Issue 2
Displaying 1-7 of 7 articles from this issue
  • [in Japanese]
    2008 Volume 15 Issue 2 Pages 1-2
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (288K)
  • KAZUKO TAKAHASHI, HIROYA TAKAMURA, MANABU OKUMURA
    2008 Volume 15 Issue 2 Pages 3-38
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We propose a method for estimating class membership probabilities of a predictedclass in multiclass classification, using scores outputted by a classifier (classification scores), not only for the predicted class but also for other classes in a document classification.Class membership probabilities are important in many applications of document classification, in which multiclass classification is often applied.As a ethod for estimating class membership probabilities by using multiple scores, we propose two kinds of methods.One is generating an accuracy table with smoothing methods such as the moving average or a moving average with coverage, which indirectly estimates class membership probabilities by referring the accuracy table. The other is applying a logistic regression estimated parameters beforehand, which directly estimate these probabilities.Through experiments on two different datasets with both Support Vector Machines and Naive Bayes classifiers, we show that the use of multiple classification scores is much effective in both methods.We also show that the proposed smoothing method for the accuracy table works quite well, and that the method applying a logistic regression is more stable.Moreover, the estimated class membership probabilities by the proposed method are useful in the detection of the misclassified samples.
    Download PDF (6850K)
  • AKIRA TERADA, MINORU YOSHIDA, HIROSHI NAKAGAWA
    2008 Volume 15 Issue 2 Pages 39-58
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    To identify a synonym is a necessary procedure for text processing such as information retrieval and text mining. We can expect to improve the proficiency and performance in text processing by constructing a synonym dictionary. Same words might possibly be used as a different meaning if the target field differs, so a synonym dictionary has to be constructed for each field. In some fields in Japanese, such as in aviation, synonym nouns include kanji/hiragana, katakana, alphabet and their abbreviations. Many of these words are not registerd in a general dictionary. In addition, as new words always come to be used, the dictionary update is a big issue.
    In this paper, we propose a system for constructing a synonym dictionary. The system will return synonym candidates on the descending order of similarity against a query. A synonym can be easily registered in a dictionary by looking the synonym candidates generated by the proposed system. We define a context information as words frequency appearing around a target word. Then a similarity is calculated by cosine measure using context information. We confirmed that the system performance was remarkably improved by providing the system with known synonym set to make context word nominalization, especially when the performance was low. We experimentally evaluated the system performance by aviation safety reports in Japanese and evaluated it by average precision, and got promising results.
    Download PDF (2257K)
  • TOSHIAKI NAKAZAWA, SADAO KUROHASHI
    2008 Volume 15 Issue 2 Pages 59-74
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper, we propose a novel method to measure the consistency of alignment as a whole. It is based on probabilistic features, using dependency type distance and distance-score function. Since this method is based on tree structure, the linguistic difference between source and target language is successfully grasped. Moreover, with this method, appropriate correspondences can be selected among corresponding candidates. We conduct experiments on Japanese-English newspaper corpus, and achieve reasonably high accuracy compared with other language pairs which have less linguistic differences.
    Download PDF (4887K)
  • SUGURU MATSUYOSHI, SATOSHI SATO
    2008 Volume 15 Issue 2 Pages 75-99
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Automatic paraphrasing is a transformation of expressions into semantically equivalent expressions within one language. For generating a wider variety of phrasal paraphrases in Japanese, it is necessary to paraphrase functional expressions as well as content expressions. We propose a method of paraphrasing of Japanese functional expressions under style and readability specifications using a dictionary with two hierarchies: a morphological hierarchy and a semantic hierarchy. A remarkable characteristic of Japanese functional expressions is that each functional expression has many different variants. Each variant has one of four styles. In paraphrasing of Japanese functional expressions, a paraphrasing system should accept style specification, because consistent use in style is required. At the same time, control of readability of generated text is important in several applications, such as a reading aid, because functional expressions are critical units that determine sentence structures and meanings. Our system generates appropriate alternative expressions for 79% of source phrases in Japanese in an open test.
    Download PDF (2830K)
  • YASUTAKA SAWAI, KAZUHIDE YAMAMOTO
    2008 Volume 15 Issue 2 Pages 101-136
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We propose a new measure to estimate level of public interest given a document. Although personal interests is of great variety, public interest, that is collection of personal interests, has consistency to some extent regardless of time difference. The task here is not to know whether a given document has interest or not, but to know how much interest a given document has, that expects enabling deep interest analysis by use of our measure. This problem has many applications such as display control of documents on the Web, that is assumed to be seen by public. We use in this paper document collection with ranking information in terms of public interest. We estimate level of interest for each word, and then for each document by utilizing the ranking information. As feature set we use three kinds: content words, compound words, and the combination of them. In the evaluation we use newspaper ranking as a source, and evaluate the performance by comparing our output to the real ranking. The results illustrates that the extended rank coefficient of these two rankings is 0.867. We also show that more than 0.90 accuracy is attained for rejecting little interest documents.
    Download PDF (3628K)
  • Irena Srdanovic Erjavec, Tomaz Erjavec, Adam Kilgarriff
    2008 Volume 15 Issue 2 Pages 137-159
    Published: April 10, 2008
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries of a word's grammatical and collocational behaviour), a distributional thesaurus, and robot use. We describe the steps taken to gather and process the corpus and to establish its validity, in terms of the kinds of language it contains. We then describe the development of a shallow grammar for Japanese to enable word sketching. We believe that the Japanese web corpus as loaded into the Sketch Engine will be a useful resource for a wide number of Japanese researchers, learners, and NLP developers.
    Download PDF (8305K)
feedback
Top