There has been a growing interest in the technologies of information access and analysis targetting blog articles recently. In order to provide the research community with the basic data, we constructed a blog corpus that consists of 249 articles (4,186 sentences) and has the following features: i) Annotated with sentence boundaries. ii) Annotated with grammatical information about morphology, dependency, case, anaphora, and named entities, in a way consistent with Kyoto University Text Corpus. iii) Annotated with sentiment information. iv) Provided with HTML files that visualize all the annotations above. We asked 81 university students to write blog articles about either the sightseeing of Kyoto, cellphones, sports, or gourmet. In constructing the annotated blog corpus, we faced problems concerning sentence boundaries, parentheses, errata, dialect, a variety of smiley, and other morphological variations. In this paper, we describe the specification of the corpus and how we dealt with the above problems.
View full abstract