2010 Volume 138 Pages 1-23
Although Japanese has been lagging behind the other major languages of the world in the utilization of electronic corpora in linguistic studies, the situation is changing rapidly due to several factors including, notably, the ongoing construction of a balanced corpus of the language at the National Institute for Japanese Language and Linguistics.
This paper focuses on collocation, a linguistic phenomenon which can be analyzed reliably only by using large corpora, and explores the possible roles which corpora may play in the compilation of a dictionary of Japanese, be it a dictionary of an ordinary kind or a collocational dictionary. The three collocational aspects of Japanese examined by way of corpus analysis are: 1) the concept of ‘circumcollocate’, 2) the degree of markedness of verbs and adjectives, and 3) the semantic differences between synonymous idiomatic grammatical phrases. The paper will demonstrate the ways in which corpora may have lexicographic significance in each of those domains.
A large corpus is required for the retrieval of collocational information. The paper uses a Web corpus, constructed by the author in 2008, which consists of approximately 75 billion characters. This is equivalent to 150 gigabytes in file size, or three to four hundred thousand Japanese novel books of average size.