Mathematical Linguistics
Online ISSN : 2433-0302
Print ISSN : 0453-4611
Paper (A)
Comparing the Usage Rate of a Word between Two Corpora
Which should We Use as an Observation Unit (Case), a Word or a Text?
Hideaki Mori
Author information
JOURNAL OPEN ACCESS

2017 Volume 31 Issue 3 Pages 205-221

Details
Abstract

If any differences are found in the usage rate of a word between two corpora, the common method to verify them is to conduct a chi-square test using word frequencies. However, when an assumed word is used as an observation unit, there is criticism that it does not to meet the assumption of randomness underlying the statistical test. Basically, the choice of words and their regularity depends on the author’s judgment. In comparing the usage rate of a word, texts that reflect the author’s judgement, rather than the individual words should be observed as the observation units. In this paper, we propose an analytical method using a text as the observation unit to compare the usage rate of a word between two corpora. Differences of the usage rate of a word can be explained by the differences in the text frequency distribution. Furthermore, using text frequencies to perform a chi-square test makes it possible to effectively demonstrate the degree to which the text distribution varies between two corpora based upon their effect size. In comparing the usage rate of a word, therefore, we should consider the text rather than the word as an observation unit.

Content from these authors
© 2017 The Mathematical Linguistic Society of Japan

この記事はクリエイティブ・コモンズ [表示 - 非営利 - 改変禁止 4.0 国際]ライセンスの下に提供されています。
https://creativecommons.org/licenses/by-nc-nd/4.0/deed.ja
Previous article Next article
feedback
Top