Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Volume 21, Issue 2
Displaying 1-10 of 10 articles from this issue
Preface
Paper
  • Yasuharu Den, Hanae Koiso
    2014 Volume 21 Issue 2 Pages 99-123
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    In this paper, we (i) propose a general-purpose database schema that can represent multichannel and multimodal spoken discourse corpora, and (ii) develop tools to construct a database, instantiating this schema with reference to configuration files, from annotations in various formats that have been created with existing annotation tools. Spoken discourse corpora involve more than 10 different annotations including both verbal and nonverbal information. They require the integration of a large number of linguistic/nonlinguistic units and relations among them and the function to search them with complex queries referring to multiple units. In spoken discourse corpora, it is essential to utilize existing annotation tools, which are widely used in the community. We propose a method to construct an environment for the usage of spoken discourse corpora that effectively utilizes existing annotation and search tools. The method has been applied to spoken discourse corpora developed by different organizations, and has been used effectively for corpus-linguistic research.
    Download PDF (1023K)
  • Takashi Tsutsui, Takuya Gaman, Takashi Oshiro, Kohei Sugawara, Takahir ...
    2014 Volume 21 Issue 2 Pages 125-155
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    In recent years, minutes of regional assemblies and the National Diet have been published on the web. Those minutes have long recorded transcribed discussions of mayors and members of assemblies. Therefore, they are a target of study in various fields such as politics, economics, linguistics, information engineering. Since the minutes of the National Diet are maintained in electronic form and freely available via a search system, many researchers have utilized the minutes as a target of study. Minutes of regional assembly meetings are also the focus of researchers in various fields. However, researchers have had trouble gathering and preparing minutes for their study, because the way in which minutes are made available to the public varies assembly by assembly. It is very inefficient for each researcher to make the effort to digitize minutes separately. To improve the situation and contribute to research communities, we have collected regional minutes of assemblies and constructed the corpus of regional assembly minutes. In this paper, we discussed the construction of the corpus of regional assembly minutes. The corpus records minutes from regional assemblies all over Japan that are available on the web. We added additional information to the corpus, such as “date,” “name of meeting,” “name of speaker,” “text of statement,” so that users may search statements across the corpus using such information. The final goal of our project is to build a political information system that can recommend a suitable person, or members of an assembly, according to the consistency between users’ opinions and statements of assembly members. As a preliminary step of development, we annotated a part of the corpus with information about the speaker’s attitude to specific political subjects, including degree of approval/disapproval. In this paper, we also report the result of the annotation.
    Download PDF (3110K)
  • Hideyuki Shibuki, Masahiro Nakano, Rintaro Miyazaki, Madoka Ishioroshi ...
    2014 Volume 21 Issue 2 Pages 157-212
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    Over a span of three years, we have constructed and improved four corpora that are the basis for generating summaries for the verification of information credibility. The summary generated to verify the credibility of information is a brief document composed of extracts from Web documents; it provides material to the user for judging the validity of a statement. In this paper, we describe a set of tags designed for observing annotation and preparing a gold standard for the summary. Further, we describe the method of annotation. Because examining each web document for its appropriateness in contributing to the summary is difficult, we describe the methodology of obtaining appropriate documents. Furthermore, we share our observations and learnings from the process of constructing these corpora.
    Download PDF (8422K)
  • Masatsugu Hangyo, Daisuke Kawahara, Sadao Kurohashi
    2014 Volume 21 Issue 2 Pages 213-247
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    Recently, there have been active studies of semantic analysis in the field of natural language processing. To study semantic analysis, a corpus annotated with semantic relations is required. Although existing corpora annotated with semantic relations have been restricted to newspaper articles, there are texts of various genres and styles containing linguistic expressions that are missing in newspaper articles. In this paper, we define annotation criteria for linguistic phenomena which have not been treated using existing criteria. We have built a diverse document leads corpus annotated with semantic relations. We report the statistics of this corpus.
    Download PDF (683K)
  • Suguru Matsuyoshi
    2014 Volume 21 Issue 2 Pages 249-270
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    This paper proposes an annotation scheme for the focus of negation in Japanese text. Negation has a scope, and its focus falls within this scope. The scope of negation is the part of the sentence that is negated. The focus of negation is the part of the scope that is prominently negated. In natural language processing, correct interpretation of negated statements requires precise detection of the focus of negation in the statements. As a foundation for developing a focus detector, we have annotated a part of “Rakuten Travel: User Review Data” and a part of a newspaper subcorpus of the “Balanced Corpus of Contemporary Written Japanese,” with our annotation scheme. In this scheme, a negation cue in the text data is linked to the focus by annotation with identifying clues. These clues include focus particles such as “wa” and “shika,” and other expressions in the context. We report 1,327 negation cues and the foci in the corpora.
    Download PDF (724K)
  • Yohei Seki
    2014 Volume 21 Issue 2 Pages 271-299
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    Recent sentiment analysis studies have demonstrated that many services such as public opinion surveys and reputation analyses are derived from a variety of documentary resources. The annotated corpus in sentiment analysis is one essential resource, as are other NLP technologies such as POS tagging and named entity extraction. The sentiment annotation policy should be defined according to the task and relevant document genre. Recently, many sentiment corpora have been published in news, review, and blog genres. However, a sentiment corpus in the dialog document genre, which involves questions and answers, has yet to be studied, and a sentiment annotation policy has yet to be clearly defined. In this paper, we explain an approach to annotating and creating a sentiment corpus with detailed sentiment types using community QA documents in BCCWJ. We also identify the different sentiment characteristics in a corpus through combinations of annotations to provide novel insights in the challenging topics of opinion question answering and domain adaptation.
    Download PDF (718K)
  • Toshinobu Ogiso, Takenori Nakamura
    2014 Volume 21 Issue 2 Pages 301-332
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    “Balanced Corpus of Contemporary Written Japanese” is a large-scale Japanese corpus of 100 million words. It contains 170,000 XML files annotated with two levels of morphological information: short-unit word and long-unit word. We have constructed an annotation system to compile this corpus. The system allows many users to modify corpus annotations and dictionary entries, which are related to each other, while ensuring consistency. The system consists of a relational database server called the “Morphological Information Database,” a client tool that maintains the morphological information of the corpus called “Dynagon,” and a tool that manages dictionary entries for morphological analysis called “UniDic Explorer.” This paper describes the design, implementation, and operation of this “Morphological Information Database” for BCCWJ.
    Download PDF (1705K)
  • Yuichiroh Matsubayashi, Ryu Iida, Ryohei Sasano, Hikaru Yokono, Suguru ...
    2014 Volume 21 Issue 2 Pages 333-377
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    Japanese corpora annotated with predicate-argument structure (PAS) have been constructed as part of several research projects and these annotated corpora have significantly advanced the field of PAS analysis. However, according to an inter-annotator agreement study and qualitative analysis of the existing corpora, there is still a strong need for further improvement of the annotation guidelines of the corpora. To improve the quality of PAS annotation guidelines, we have collected and summarized the practical knowledge and a list of problematic issues concerning the task of the PAS annotation through discussions with researchers actively engaged in the construction of NAIST Text Corpus (NTC) and Kyoto Text Corpus (KTC), researchers concerned with existing PAS annotation guidelines, and an annotator who is working on the annotation task, using NTC and KTC guidelines. This paper reports the problems and suggestions that we collected and possible solutions to those problems on the basis of results of the discussions. Finally, we suggest a method for continuously improving annotation guidelines.
    Download PDF (898K)
  • Shunsuke Kozawa, Kiyotaka Uchimoto, Yasuharu Den
    2014 Volume 21 Issue 2 Pages 379-401
    Published: April 18, 2014
    Released on J-STAGE: July 17, 2014
    JOURNAL FREE ACCESS
    Existing dictionaries, corpora, analyzers are not usually applicable to research using new part-of-speech tagset in the fields of linguistic research. Dictionaries and corpora are often newly constructed. On the other hand, existing analyzers can be reused by improving them. However, it is not clear how they could be improved. This paper describes how an analyzer constructed for analyzing a certain corpus can be applied to another corpus with a different part-of-speech tagset. In particular, we improved the features and labels used to train a long-unit-word analyzer based on Corpus of Spontaneous Japanese (CSJ) by focusing on the differences between CSJ and Balanced Corpus of Comtemporary Written Japanese (BCCWJ) and applied the analyzer to BCCWJ. The experimental results show the advantage of the proposed method.
    Download PDF (906K)
feedback
Top