This paper proposes a method to discern the nativeness of English documents with high precision based on Bayes decision and a statistical hypothesis testing. Regarding a document as a sequence of part-of-speeches, the proposed method makes a comparison between probabilities of a document by the statistical language model of native English and by that of non-native English to discern the nativeness of the document. The statistical language model used here is a
n-gram model. The
n-gram model with a large
n can be expected to treat well the difference between the native English and the non-native one and has the potential to discern the nativeness with high precision. However, when we use the
n-gram model with a large
n, the zero frequency problem and the sparseness problem become acute and we cannot rely on the maximum likelihood estimates of
n-gram probabilities. The proposed method estimates the ratio of the probability of the document by the native English language model to that by the non-native English language model using a statistical hypothesis testing. The experimental result shows that the proposed method discerns the nativeness with the precision 92.5%, which is significantly higher than by traditional methods.
View full abstract