Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 12, Issue 4
Displaying 1-13 of 13 articles from this issue
  • [in Japanese]
    2005 Volume 12 Issue 4 Pages 1-2
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (163K)
  • MASAO UTIYAMA, MIDORI TANIMURA, HITOSHI ISAHARA
    2005 Volume 12 Issue 4 Pages 3-19
    Published: August 26, 2005
    Released on J-STAGE: June 07, 2011
    JOURNAL FREE ACCESS
    English reading materials are abundant on the Internet.However, it is still difficultto select proper materials to organize courseware that can be used throughout a one-semester English reading course.We proposed a method for constructing coursewarefrom target vocabulary and corpus.This method was designed to extract a mini-mal set of articles (from the corpus) that contained the vocabulary.We applied themethod to TOEIC (Test of English for International Communication) vocabularyand The Daily Yomiuri newspaper articles.The constructed courseware consistedof articles that had dense occurrence of the target TOEIC vocabulary.The degreeof denseness was measured by comparing various statistics of the courseware withthose of the randomly sampled articles.It was found that the courseware was ef-ficient in presenting the vocabulary to students through reading.It is also used inEnglish classes in one university as supplementary material and has been shown tobe promising
    Download PDF (1857K)
  • TAKANO OGINO, YOSHIKO UEDA, MASAHIRO KOBAYASHI, HITOSHI ISAHARA
    2005 Volume 12 Issue 4 Pages 21-54
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    The study investigated the behavioral patterns of case-marking particles by usingthe valency data extracted from a large-scale authentic corpus data.In addition, the study has examined the extend to which such patterns of particle behavior con-tribute to homophone distinction.12, 400 verb concepts were included in the data, from which combination patterns of case-marking particles were generated.The token number of particle combination patterns was 37, 237 in total, 188 being thenumber of the types.An experimental homophone distinction, in which the ortho-graphic representations of paired homophone words was identified, was conductedby using the generated particle patterns.The results indicated that 73% of the tar-get homophone pairs were correctly judged at the level of case-marking particles byadding the frequency of particle combination patterns to the judgment criteria.
    Download PDF (5363K)
  • MASAYA YAMAGUCHI, MAKIRO TANAKA
    2005 Volume 12 Issue 4 Pages 55-77
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In this paper, we design and implement a full text search system “Himawari”.Himawari is designed to handle various structures and usages of language resources that are made to be used for language study and research.For the variety of structure, Himawari has the ability to search language resources structured by XML, extracting tagged information that may be used to constrain the results.Himawari provides some kind of indexes such as Suffix Array for the improvement of the search process. To resolve the problem of the variety of usages, a query and a method of reference for language resources can be defined by a user as suitable for the target language resource.Search results are displayed as a table including KWIC(Key Word In Context), and can be output to external reference system, for example, HTML browser, sound player, when the result is not able to be displayed as text data.By applying our system to a Japanese thesaurus “Bunrui Goi Hyo” and “Corpus of Spontaneous Japanese”, the adaptability for the varieties is verified and proved.
    Download PDF (10646K)
  • HIROSHI FUJII, YOICHI TOMIURA, SHOSAKU TANAKA
    2005 Volume 12 Issue 4 Pages 79-96
    Published: August 26, 2005
    Released on J-STAGE: June 07, 2011
    JOURNAL FREE ACCESS
    The automatic discrimination between documents written by native speakers andones by non-native speakers is an important technique to construct a high-qualitycorpus, to help native speakers with writing, and to gather useful knowledge in Sec-ond Language Acquisition.This paper proposes the method of such a discriminationbased on the similarity of part-of-speech trigram distributions.The distributionalsimilarity is given by Skew Divergence.Skew Divergence is an improved functionof KL Divergence, and it does not suffer from the zero-frequency problem.To use Skew Divergence, it needs to decide the value of the parameter α in Skew Divergence.However, there have not been any sufficient discussions on how to decide it.This pa-per also proposes one of the methods how to set the parameter αThe experimentalresult shows the effectiveness of the proposed method.
    Download PDF (2179K)
  • WAKAKO KASHINO, MASAYA YAMAGUCHI, RIKA KIRYU, MAKIRO TANAKA
    2005 Volume 12 Issue 4 Pages 97-116
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We are studying temporal transitions of word frequencies through vocabulary analyses using large-scale databases.Specifically, this paper focuses on the use of foreign words.We choose 109 words from governmental documents.Then, we statistically investigate the appearance frequencies of the words using a newspaper full-text database accumulated for 16 years.Our analysis shows that the word appearance frequency patterns can be classified roughly into four types.Then, linguistic and social background for those types are discussed introducing indices such as “understanding rate” and the first-use year of the words.
    Download PDF (3701K)
  • KYONGHEE PAIK, KIYONORI OHTAKE, BOND FRANCIS, KAZUHIDE YAMAMOTO
    2005 Volume 12 Issue 4 Pages 117-136
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In order to investigate the effect of source language on translations, we examine two variants of a Korean translation corpus.The first variant consists of Korean translations of 162, 308 Japanese sentences from the ATR BTEC (Basic Expression Text Corpus).The second variant was made by translating the English translations of the Japanese sentences into Korean.We show that the source language text has a large influence on the target text.Even after normalizing orthographic differences, fewer than 8.3% of the sentences in the two variants were identical.We describe in general which phenomena differ and then discuss how our analysis can be used in natural language processing.
    Download PDF (2175K)
  • Oh-pyo Kweon, Akinori Ito, Motoyuki Suzuki, Shozo Makino
    2005 Volume 12 Issue 4 Pages 137-156
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper describes a method to detect grammatical errors from a non-native speaker's utterance for a dialogue-based CALL (Computer Assisted Language Learning) system. For conversation exercises, several dialogue-based CALL systems were developed. However, one of the problems in conventional dialogue-based CALL systems is that a learner is usually assigned a passive role. The goal of our system is to allow a learner to compose his/her own sentences freely in a role-playing situation. One of the biggest problems in realizing the proposed system is that the learner's utterance inevitably contains pronunciation, lexical and grammatical errors. In this paper, we focus on the correction of the lexical and grammatical errors. To correct these errors, we propose two methods to detect lexical/grammatical errors in an utterance. The conventional methods are to write a grammar that accepts the errors manually. The proposed methods 1 and 2 use the ‘error rules’ that are independent of the recognition grammar. The method 1 uses only correct system grammar and extends the recognition results using the ‘error rules’. The method 2 uses a general grammar (which does not consider the relationship between verb, particle and each noun) to recognize the learner's utterance and check acceptance of each N-best result and searches the learner's utterance. The grammar error detection experiment proved that the method 2 performs as well as the conventional method.
    Download PDF (2913K)
  • MASAKATSU SHIMIZU, YUMIKO SHIMIZU, HIROYUKI AKAMA
    2005 Volume 12 Issue 4 Pages 157-192
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    The aim of this paper is to find out some practical criteria for choosing “on” or “Pon”, which are semantically equivalent as French impersonal subject pronoun.As for this alternative, a grammarian of 17th century, Vaugelas, proposed some constraints depending on the neighboring sounds or spellings and based on the easiness to pronounce, write, hear and read, which have come down to almost all the subsequent grammarians.But in addition to these “euphonic rules”, it cannot be denied that there are some effects of an unknown heterogeneous parameter for this alternative, “historical or psychological identity of word”, explicable from the viewpoint of ideological linguistics ascribed to Damourette and Pichon.(1936).
    We extracted from some electronic corpora overall instances of “on” and “l'on” with the preceding and following words, to simulate by the algorithm called C5.0 the hid-den rules governing the discrimination process.As a result, various causal factors (not only “euphonic rules”, but also “historical rules”) can be put together into a branching diagram to connect them with each other.Thus we have come to conclude that our complex probabilistic analysis is fully effective for the discrimination of “on” and “l'on”.
    Download PDF (19996K)
  • EMI IZUMI, KOYOTAKA UCHIMOTO, HITOSHI ISAHARA
    2005 Volume 12 Issue 4 Pages 193-210
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Language learners use various kinds of communication strategies to compensate their imperfect knowledge on the target language, especially vocabulary.Paraphrase is one of the important strategies.It must be worth doing to teach how paraphrase in foreign language communication can be done more effectively.In vocabulary teaching, if the certain word entry is introduced with several other relevant words or expressions, it would be quite helpful for learners to do more successful paraphrase.In this research, we have done the experiment to see how this kind of expanded teaching of vocabulary or expressions can be accomplished by using diverse expressions extracted from the spoken corpus of Japanese learner English, “The NICT JLE (Japanese Learner English) Corpus”.In the experiment, we first made the keyword list based on the small amount of English native speakers' utterances by hand. Then, we automatically extracted the diverse expressions which describe the particular matter from the learner data to make the learners' expression list.We have obtained approximately 50% for recall and precision for a complex task, while for an easy task, 70% and 60% for recall and precision respectively.It is assumed that this expression list can be utilized for teaching paraphrasing strategy as one of the examples of the use of learner corpora in the foreign language classroom.
    Download PDF (3402K)
  • EMI IZUMI, KOYOTAKA UCHIMOTO, HITOSHI ISAHARA
    2005 Volume 12 Issue 4 Pages 211-225
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In foreign language education, it is important for teachers to know their students'acquisition order of major linguistic items in the target language.There have been alot of studies done for revealing natural sequence in second language acquisition since1970's, and it is one of the established ideas that major grammatical morphemes areacquired in the common order by learners across different backgrounds such as theirL1, ages, or learning environments (The fist hypothesis).However, since 1980's, a contradictory hypothesis that the difference in learners'backgrounds can cause thedifference in their acquisition orders (The second hypothesis) has been introducedby several studies mainly on the acquisition order of Japanese learners of English.In these studies, they found that Japanese learners had a unique acquisition orderwhich is different from the natural sequence which supports the first hypothesis.Inthis paper, we tried to see which of these two contradictory hypotheses could be ap-plied in the error-tagged corpus of Japanese leaner English named “The NICT JLE (Japanese Learner English) Corpus”.In the experiment, we have found that thereis no significant correlation between the sequence which supports the first hypothe-sis and the sequence extracted from our corpus.On the other hand, there was thesignificant correlation between our sequence and the one which supports the secondhypothesis.The most distinguished difference between our sequence and the onewhich supports the first hypothesis is that articles and plural-s are acquired in thelater stage by Japanese learners.This might arise from L1 transfer because Japaneselanguage does not have any relevant markers with articles and plural-s.Therefore, we concluded that our result can support the second hypothesis.We assume thatthis analysis can be expanded by using information on learners'errors or proficiencylevels in the NICT JLE Corpus, to give us some important clues for improving thecurrent language teaching system.
    Download PDF (1676K)
  • RYO NAGATA, FUMITO MASUI, ATSUO KAWAI, NAOKI ISU
    2005 Volume 12 Issue 4 Pages 227-243
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper proposes an unsupervised method for distinguishing mass and countnouns in context using decision lists.The mass count distinction is particularly im-portant in detecting errors concerning the articles and the singular/plural usage inthe writing of Japanese learners of English.Decision lists are learned from a set oftraining data that consist of instances of the target noun used as mass or count. Ingeneral, it is costly and time-consuming to acquire a set of training data.To solvethe problem, this paper also proposes a method for automatically generating trainingdata.Experiments show that the proposed method achieves an accuracy of 83.9%in distinguishing mass and count nouns in context.
    Download PDF (1722K)
  • SATOSHI SEKINE, YOSHIYUKI TAKEDA, KENJI YOSHIHIRA
    2005 Volume 12 Issue 4 Pages 245-252
    Published: August 26, 2005
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    A KWIC (KeyWord In Context) system is a useful tool to investigate the usage oflanguage.We developed a KWIC system for a huge WEB text.The text data isextracted from about 350 giga byte WEB pages and contains more than 10 billioncharacters.It was done by a crawler for about 2month period.The amount of thetext data exceeds 4 giga bytes which can be expressed in 32 bits.We developed asuffix array indexer which can handle 40 bits and the system searches sentences withdesired keywords in it.In order to show the usefulness of the system for Japaneselearners as a second language, we collect KWIC data for “TO-ITAMU (painful like)” and analyzed onomatopoeia appear before the expression.
    Download PDF (3896K)
feedback
Top