Corpora have played a crucial role in natural language processing and linguistics. However, there have been very few corpora consisting of the writing of children because of difficulties peculiar to child corpus creation. In this paper, we propose a method for avoiding the difficulties and efficiently creating a child corpus. We have used the proposed method to create a child corpus to show its effectiveness. As a result, we have obtained a child corpus called
Kodomo Corpus containing 39,269 morphemes, which is the largest written child corpus.
Kodomo Corpus has a feature that the editing histories such as addition and deletion are traceable through its data tags.
View full abstract