Journal of Natural Language Processing

[title in Japanese]

[in Japanese]

2005Volume 12Issue 2 Pages 1-2
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_1

JOURNAL FREE ACCESS

Download PDF (195K)
Automatic Occupation Coding with Machine Learning and Hand-Crafted Rules

KAZUKO TAKAHASHI, HIROYA TAKAMURA, MANABU OKUMURA

2005Volume 12Issue 2 Pages 3-23
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_3

JOURNAL FREE ACCESS

Show abstractHide abstract

We apply a machine learning method to occupation coding, which is a task to categorize answers to open-ended questions about respondent's occupation.Specifically, we use Support Vector Machines (SVMs) and their combination with hand-crafted rules.Conducting occupation coding manually is expensive and sometimes leads to inconsistent coding results when coders are not experts in occupation coding. For this reason, a rule-based automatic method was developed and applied.However, its categorization performance was not satisfactory.Therefore, we adopt SVMs, which show high performance in various fields, and compare them with the rule-based method.We also investigate effective combination methods of SVMs and the rulebased method.We empirically show that SVMs outperform the rule-based method in occupation coding and that the combination of the two methods yields even better accuracy, and that the accuracy of each method increases if the part of the new samples is added to the training data.

View full abstract

Download PDF (2397K)
A Description Method of Syntactic Rules on Filmscripts

YOSHIAKI KUROSAWA, TAKUMI ICHIMURA, TERUAKI AIZAWA

2005Volume 12Issue 2 Pages 25-62
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_25

JOURNAL FREE ACCESS

Show abstractHide abstract

The purpose of this paper is to propose a describing method of syntactic rules in order to analyze sentences on film scripts and manage the stored rules. Because fllm scripts include spoken language, which has different characteristics from written language, we may meet ellipses of various elements and inversion in the sentences. To analyze these sentences correctly by using a parser, some unique rules corresponding to these characteristics are required.In order to solve such problems, we develop a new method of parser to deal with a no-limitation description of syntactic rules with regular expressions.With regular expressions, this method allows us to describe the expression in a syntax rule such as “(auxiliary-verb I ending-particle) .” Because this expression is adapted regardless of whether ellipsis occurs, the method can deal with the ellipses without any difficulties. Because the expression can also be adapted to some combinations of many part-of-speeches, the number of rules should dramatically decrease.Under the basis of this idea, we developed a system with approximately 3, 000 syntactic rules, which are preliminary developed and checked by human, based on about 40, 000 sentences from 21 film scripts. These rules were in detail described to analyze various fashions of speech. In order to verify the effectiveness of the syntactic rules, we experimented a new set of other filmscripts. As a result, we obtained an analysis system with a high accuracy.And we found adopting regular expressions helped us reduce the cost to maintain syntactic rules because only one rule could express about 10 lists of part-of-speeches analyzed.

View full abstract

Download PDF (5750K)
Automatic Construction of Japanese-Chinese Translation Dictionary Using English as Intermediary

YUJIE ZHANG, QING MA, HITOSHI ISAHARA

2005Volume 12Issue 2 Pages 63-85
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_63

JOURNAL FREE ACCESS

Show abstractHide abstract

Electronic translation dictionaries are indispensable language resources in machine translation, cross-language information retrieval, and e-learning.Many electronic dictionaries that translate between English and one other language have been thoroughly developed, while the ones between two languages other than English are still lacking.Because the construction of an electronic translation dictionary is usually expensive and time-consuming, a method must be developed for automatically constructing a translation dictionary for new language pairs by using the existing translation resources between English and other languages.This paper describes our research on constructing a Japanese-Chinese translation dictionary from two existing dictionaries, the EDR Japanese-English and LDC English-Chinese dictionaries, using English as intermediary.In automatic acquisition of bilingual lexicons, one major issue is how to select correct translations from among a large number of candidates. We have developed a method of ranking candidate translations by utilizing three sources of information, the number of English translations in common, the part of speech, and Japanese kanji information.We evaluated the method on 109 Japanese words, each of which has over 20 candidate Chinese translations.The proposed method achieved 81.4% precision.

View full abstract

Download PDF (2387K)
Analysis of the Effect of Title Pattern on the Reader's Interest

YASUKO SENDA, YASUSI SINOHARA, MANABU OKUMURA

2005Volume 12Issue 2 Pages 87-107
Published: March 31, 2005
Released on J-STAGE: June 07, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_87

JOURNAL FREE ACCESS

Show abstractHide abstract

In order to attract readers'interest, authors have to compose titles according to the properties of the readers.We so far clarified the typical content and wording of titles for the document which explains a newly developed technology in a comparative study between paper titles and newspaper headlines.In order to utilize the findings of the study on the content and wording for composing titles, we have to survey on the effects of the content and wording of titles on the readers'interest.In this paper, therefore, we report the results of a questionnaire survey on the title and the readers' interest.

View full abstract

Download PDF (6786K)
Gradual Fertilization of Case Frames

DAISUKE KAWAHARA, SADAO KUROHASHI

2005Volume 12Issue 2 Pages 109-131
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_109

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper proposes an automatic method of gradually constructing case frames. First, a large raw corpus is parsed, and initial case frames are constructed from reliable predicate-argument examples in the parsing results.Second, case analysis based on the initial case frames is applied to the large corpus, and the case frames are upgraded by incorporating newly acquired information.Case frames are gradually fertilized in this way.We constructed case frames from news paper articles of 26 years, consisting of 26M sentences.The case frames are evaluated by hand, and furthermore evaluated through syntactic and case analysis.These results presented the effectiveness of the constructed case frames.

View full abstract

Download PDF (2205K)
Alignment of lecture speech data and presentation documents based on discourse markers and text length

STATOSHI NAKAZAWA, KENJI SATOH, AKITOSHI OKUMURA

2005Volume 12Issue 2 Pages 133-156
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_133

JOURNAL FREE ACCESS

Show abstractHide abstract

Keyword search for the certain scene in video data seems to be in great demand as well as text search.For the video search, a conventional approach is to apply speech recognition to video voice signals and use the results as a text index with time information. However, speech recognition has problems such as recognition errors and unknown words, and recognition results themselves do not work as a precise index. If there are detailed scripts or transcripts of a video available, it is possible to make a precise index synchronized with the video, by aligning the script and the speech recognition results, but not every video comes with detailed scripts.We would like to propose a new approach which enables to make a text index without detailed scripts but with presentation slides.We focus on lecture videos, and we will explain how to make a text index by aligning two different materials;speech recognition results and presentation slides.We align them by slide so that keyword search for lecture videos can be done by slide.

View full abstract

Download PDF (10895K)
Similarity Measures for Extracting KOOU Expressions from Corpora

EIKO YAMAMOTO, ATSUKO KIDA, KYOKO KANZAKI, HITOSHI ISAHARA

2005Volume 12Issue 2 Pages 157-174
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_157

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we propose a similarity measure for extracting certain concord expression from many and large text corpora. The concord expression is called KOOU expression, which is particular to Japanese. KOOU expression consists of KO element and OU element. The KO element is called “declarative adverb” and can decide an expression of verb (OU element). In Japanese, using the KOOU expression, we are able to understand a sentence gradually. So far, the practicaldatabase of KOOU expressions does not exist. We attempt to extract the KOOU expressions. Then, we compare and evaluate seven similarity measures to establish a method of objectively and comprehensively extracting KOOU expressions. We make a judgment data with using a pooling method that is wellknown in information retrieval to evaluate extracted results. We pool top 500 of results for each measure and then judge them by human. From our experiment results, we report Yate's correction and Complementary Similarity Measure are adaptable for extracting KOOU expressionsfrom corpora.

View full abstract

Download PDF (1914K)
Improving Word Alignment Quality by Relearning Translation Models

SETSUO YAMADA, MASAAKI NAGATA, KENJI YAMADA

2005Volume 12Issue 2 Pages 175-188
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_175

JOURNAL FREE ACCESS

Show abstractHide abstract

The statistical Machine Translation Model has two components: a language model and a translation model. This paper describes how to improve the quality of the translation model at the point of word alignment quality by using the common word pairs extracted by two asymmetric learning approaches. One set of word pairs is extracted by Viterbi alignment using a translation model, the other set is extracted by Viterbi alignment using another translation model created by reversing the languages. The common word pairs are extracted as the same word pairs in the two sets of word pairs. We conducted experiments using English and Japanese. Our method improves the quality of a original translation model by 5.7% independent of the training domain and the translation model.We also show that common word pairs are almost as useful as regular dictionary entries for training purposes. Moreover, we describe effects of the common word pairs by iterating our learning process and changing the number of learning data.

View full abstract

Download PDF (4089K)
Characteristics of Jananese Derivative Word Kana-Kanji Conversion using Generalized Examples and Thesaurus

NATSUKI ICHIMARU, TEIGO NAKAMURA, TORU HITAKA

2005Volume 12Issue 2 Pages 189-207
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_189

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper shows the effectiveness of applying our method to kana-kanji conversionof Japanese derivative words by experiments. In usual natural language processing, morphologic analysis precedes syntactic analysis.Meanwhile, each derivative word needs semantic analysis first, because it has inner structure composed of a noun and a suffix, and simultaneously acts as one word.Thus, in our method, we use a PCFG that consists of full size thesaurus, large number of examples generalized to every intermediate level, a part-of-speech level rule, and word level lexicalized rules. Moreover, we weight the frequencies of the examples to prioritize high-density region, and choose the best learning condition according to the result of investigation of characteristics curves in various situations.In some former researches, they have thought that applying thesaurus to syntactic analysis was not so effective.However, it seems that it was because of the lack of training data and improper use of generalization. With enough number of examples and optimum generalization, our method can achieve over 95%, higher accuracy than ever thought.

View full abstract

Download PDF (3493K)
Use of Multiple Documents as Evidence with Decreased Adding in a Japanese Question-answering System

Masaki Murata, Masao Utiyama, Hitoshi Isahara

2005Volume 12Issue 2 Pages 209-247
Published: March 31, 2005
Released on J-STAGE: March 01, 2011

DOIhttps://doi.org/10.5715/jnlp.12.2_209

JOURNAL FREE ACCESS

Show abstractHide abstract

We propose a new method of using multiple documents as evidence with decreased adding to improve the performance of question-answering systems.Sometimes, the answer to a question may be found in multiple documents.In such cases, using multiple documents to predict answers may generate better answers than using a single document.Our method therefore uses information from multiple documents, adding the scores of candidate answers extracted from various documents.However, because simply adding the scores can degrade the performance of question-answering systems, we add the scores with progressively decreasing weights to reduce the negative effect of simple adding.We carried out experiments using the Question-Answering Challenge (QAC) test collection.The results showed that our method produced a statistically significant improvement, with the degree of improvement ranging fro 0.05 to 0.14.These results, and the fact that our method is simple and easy to use, indicate its potential feasibility and utility in question-answering systems.Experiments comparing our decreased adding method with several previously proposed methods that use multiple documents showed that our method was more effective than these other methods.

View full abstract

Download PDF (3955K)

Register with J-STAGE for free!