Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 8, Issue 1
Displaying 1-12 of 12 articles from this issue
  • [in Japanese]
    2001 Volume 8 Issue 1 Pages 1-3
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (259K)
  • MINORU SASAKI, KENJI KITA
    2001 Volume 8 Issue 1 Pages 5-19
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Vector space model is a conventional information retrieval model, in which text documents are represented as high-dimensional and sparse vectors using words as features in a multidimensional space. These vectors require a large number of computer resources and it is difficult to capture underlying concepts referred to by the terms.In this paper, we present a technique of an information retrieval model using a random projection to project document vectors to a low-dimensional space as a way of solving these problems. To evaluate its efficiency, we show results of retrieval experiments on the MEDLINE test collection. Experiments show that the proposed method is faster than LSI (Latent Semantic Indexing) and efficient close to the LSI. In addition, we propose to apply a concept vector, which random projection needs for dimensionality reduction, produced by a spherical κ-means algorithm. A result of this evaluation shows that the concept vector captures the underlying concepts of the corpus effectively.
    Download PDF (1480K)
  • Takashi Ninomiya, Kentaro Torisawa, Jun'ichi Tsujii
    2001 Volume 8 Issue 1 Pages 21-47
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    We describe an agent-based parallel HPSG parser that operates on shared-memory parallel machines. It efficiently parses real-world corpora by using a wide-coverage HPSG grammar. The efficiency is due to the use of a parallel parsing algorithm and the efficient treatment of feature structures. The parsing algorithm is based on the CKY algorithm, in which resolving constraints between a mother and her daughters is regarded as an atomic operation. The CKY algorithm features data distribution and granularity of parallelism. The keys to the efficient treatment of feature structures are i) transferring them through shared-memory, ii) copying them on demand, and iii) writing/reading them simultaneously onto/from memory.Being parallel, our parser is more efficient than sequential parsers. The average parsing time per sentence for the EDR Japanese corpus was 78 msec and its speed-up reaches 13.2 when 50 processors were used.
    Download PDF (6649K)
  • YOSHITAKA KAMEYA, TAKASHI MORI, TAISUKE SATO
    2001 Volume 8 Issue 1 Pages 49-84
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Probabilistic context-free grammars (PCFGs) are a widely-known class of statistical language models. The Inside-Outside (I-O) algorithm is also well-known as an efficient EM algorithm tailored for PCFGs. Although the algorithm requires only inexpensive linguistic resources, there remains a problem in its efficiency. In this paper, we present a new framework for efficient EM learning of PCFGs in which the parser is separated from the EM algorithm, assuming the underlying CFG is given. A new EM procedure exploits the compactness of WFSTs (well-formed substring tables) generated by the parser. Our framework is quite general in the sense that the input grammar need not to be in Chomsky normal form (CNF) while the new EM algorithm is equivalent to the I-O algorithm in the CNF case. In addition, we propose a polynomial-time EM procedure for CFGs with context-sensitive probabilities, and report experimental results with ATR corpus and a hand-crafted Japanese grammar.
    Download PDF (3477K)
  • MINORU SASAKI, KENJI KITA
    2001 Volume 8 Issue 1 Pages 85-99
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In evaluating the effectiveness of information retrieval (IR) and extraction system, the most common method is to compare two retrieval methods and decide if one system measurably achieves better results than the other. However, it is difficult for researchers to compare more than two retrieval methods because there are many participants in IR task in IREX workshop. In this paper, we evaluate the characteristics and the effectiveness of the IR systems using a statistical method based on the results of the IR formal run and questionnaires of systems. Comparisons of systems deal with effects on the performance such as indexing, querying and retrieval model. The results confirm the effectiveness of this evaluation method because phrases relates to the performance better than words. There is a trade-off relation between the precision value at 0.0 and decrease rate in many systems and this result indicates the difficulty of the choice of techniques in system. We also evaluate correlations between the efficiency and the characteristics of the systems with both a short and long versions of the topics. A result of this evaluation shows that it is important to select effective methods for the long version of topics.
    Download PDF (1452K)
  • Diana Maynard, Sophia Ananiadou
    2001 Volume 8 Issue 1 Pages 101-125
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper examines the use of linguistic techniques in the area of automatic term recognition. It describes the TRUCKS model, which makes use of different types of contextual information-syntactic, semantic, terminological and statistical-seeking particularly to identify those parts of the context which are most relevant to terms. From an initial corpus of sublanguage texts, this identifies, disambiguates and ranks candidate terms. The system is evaluated with respect to the statistical approach on which it is built, and with respect to its expected theoretical performance. We show that by using deeper forms of contextual information, we can improve on the extraction of multi-word terms. The resulting list of ranked terms is shown to improve on that produced by traditional methods, in terms of precision and distribution, while the information acquired in the process can also be used for a variety of other applications, such as disambiguation, lexical tuning and term clustering.
    Download PDF (3559K)
  • An Approach to The Unknown Word Problem
    KIYOTAKA UCHIMOTO, SATOSHI SEKINE, HITOSHI ISAHARA
    2001 Volume 8 Issue 1 Pages 127-141
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Morphological analysis is one of the basic techniques used in Japanese sentence analysis. A morpheme is defined as the minimal grammatical unit such as a word or a suffix Morphological analysis is the process segmenting a given sentence into a row of morphemes and assigning to each morpheme grammatical attributes such as a part-of-speech (POS) and an inflection type. Recently, one of the most important issues in morphological analysis has become how to deal with unknown words, or words which are not found in a dictionary or a training corpus. So far, there have been mainly two statistical approaches for coping with this issue. One is the method of acquiring unknown words from corpora and incorporating them into a dictionary. The other is the method of estimating a model which can recognize unknown words correctly. We would like to be able to make good use of both approaches. If words acquired by the former method could be added to a dictionary and a model developed by the latter method could consult the amended dictionary, then the model could be the best statistical model which has the potential to overcome the unknown word problem. In this paper, we propose a method for Japanese morphological analysis based on a maximum entropy (M. E.) model. This method uses a model which can not only consult a dictionary with a large amount of lexical information but also recognizes unknown words by learning certain characteristics. We focused on the information such as what types of characters are used in a string in order to learn these characteristics. The model has the potential to overcome the unknown word problem. The recall and precision of the identification of a morpheme segment and its major parts-of-speech were 95.80% and 95.09%, respectively, when using the Kyoto University corpus.
    Download PDF (1638K)
  • SATORU IKEHARA, SHINJI NAKAI, JIN'ICHI MURAKAMI
    2001 Volume 8 Issue 1 Pages 143-174
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    In order to represent the knowledge for resolving the syntactical and semantic ambiguities, structural rules and their generalization methods were proposed.In this method, the structural rules are composed of structure definition part and class definition part. The former is written by the set of any of almighty symbol, syntactic attributes, semantic attributes and word itself. From the view point of the number of parameters used for defining the expression structures, the rules are classified into one-dimensional rules, two-dimensional rules and so on, and automatically generated in this order from examples. Prominent feature of this method is in generalization methods. The generated rules are furthermore generalized based on the upper to lower relation of semantic attributes or syntactic attributes to reduce the number of rules without decreasing the performance. This method was applied to generate the dependency rules for Japanese expressions of “A no B no C” which are known as the most popular noun phrases which have hard ambiguities to be resolved. As the result, it was found that structural rules can easily be obtained by this method. The experimental results showed that the dependency relations can be determined at the accuracy of 86% by the rules obtained by this method. This rate is not so low compared to human ability for this kind of ambiguous noun phrases.
    Download PDF (3400K)
  • SETSUO YAMADA, EIICHIRO SUMITA, HIDEKI KASHIOKA
    2001 Volume 8 Issue 1 Pages 175-190
    Published: January 10, 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    This paper proposes to improve translation quality by using information on dialogue participants that is easily obtained from outside the translation component. If we wish to make a conversation smooth with the dialogue translation system, it is important to use not only linguistic information, which comes from the source language, but also extra-linguistic information, which is shared between the participants of the conversation. We incorporated the participants'social role into transfer rules and dictionary entries. We conducted an experiment with 23 unseen dialogues (344 utterances) using an English-to-Japanese translation. The experiment demonstrated that recall and precision for expressions which should be polite, are 65% and 86%, respectively. Thus, our simple and easy-to-implement method is effective, which is a key technology enabling smooth conversation with a dialogue translation. Additionally, this paper discusses the useful information such as the participants'gender, and how our method could apply information on dialogue participants to other language pairs.
    Download PDF (1486K)
  • 2001 Volume 8 Issue 1 Pages 191a
    Published: 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (50K)
  • 2001 Volume 8 Issue 1 Pages 191c
    Published: 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (50K)
  • 2001 Volume 8 Issue 1 Pages 191b
    Published: 2001
    Released on J-STAGE: March 01, 2011
    JOURNAL FREE ACCESS
    Download PDF (50K)
feedback
Top