2024 Volume 31 Issue 4 Pages 1458-1486
Paraphrases that exhibit significant surface differences are valuable for data augmentation, yet their generation is known to be challenging. In this study, we develop a model capable of generating such desired paraphrases employing a straightforward mechanism to manage similarity: tags denoting semantic and lexical similarities are affixed to the beginning of input sentences. We compile a training corpus by selecting paraphrase pairs with distinct surface characteristics from a variety of pseudo-paraphrases generated via round-trip translation. Experimental results demonstrate the efficacy of our approach through data augmentation in contrastive learning and pre-fine-tuning of pretrained language models. Additionally, our findings indicate that (1) achieving an appropriate level of paraphrase similarity largely depends on the downstream task and (2) a mixture of paraphrases exhibiting varying degrees of similarity adversely affects downstream task performance.