2022 Volume 29 Issue 2 Pages 542-586
Inadequate training data renders neural grammatical error correction less effective. Recently, researchers have proposed data augmentation methods to address this problem. The methods are proposed based on the following three assumptions: (1) error diversity in generated data contributes to performance improvement; (2) error generation for a certain error type affects the correction performance of same-type errors; (3) a larger corpus used in error generation results in better performances. In this study, we design multiple error generation rules for various grammatical categories and propose a method to combine those error generation rules to validate the abovementioned assumptions by varying the error types in the generated data. Results show that assumptions (1) and (2) are valid, whereas assumption (3) is associated with the number of training steps and the number of generated errors. Furthermore, our proposed method can train a high-performance model even in unsupervised settings and more effectively correct writing errors as compared with the model based on round-trip translations. Finally, it is found that the error types corrected by the models based on round-trip and back translations differ from those corrected by our method.