Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Report
Construction of a Blog Corpus with Syntactic, Anaphoric, and Sentiment Annotations
Chikara HashimotoSadao KurohashiDaisuke KawaharaKeiji ShinzatoMasaaki Nagata
Author information
Keywords: Blog, Annotated Corpus
JOURNAL FREE ACCESS

2011 Volume 18 Issue 2 Pages 175-201

Details
Abstract
There has been a growing interest in the technologies of information access and analysis targetting blog articles recently. In order to provide the research community with the basic data, we constructed a blog corpus that consists of 249 articles (4,186 sentences) and has the following features: i) Annotated with sentence boundaries. ii) Annotated with grammatical information about morphology, dependency, case, anaphora, and named entities, in a way consistent with Kyoto University Text Corpus. iii) Annotated with sentiment information. iv) Provided with HTML files that visualize all the annotations above. We asked 81 university students to write blog articles about either the sightseeing of Kyoto, cellphones, sports, or gourmet. In constructing the annotated blog corpus, we faced problems concerning sentence boundaries, parentheses, errata, dialect, a variety of smiley, and other morphological variations. In this paper, we describe the specification of the corpus and how we dealt with the above problems.
Content from these authors
© 2011 The Association for Natural Language Processing
Previous article
feedback
Top