2021 Volume 28 Issue 2 Pages 508-531
We present a word embedding-based monolingual phrase aligner. In monolingual phrase alignment, an aligner identifies the set of phrasal paraphrases in a sentence pair. Previous methods required large-scale lexica or high-quality parsers. Consequently, applying them to languages other than English is difficult. Unlike them, the proposed method uses only a pre-trained word embedding model, and thus it relies solely on raw monolingual corpora. Our method yields word alignments using pre-trained word embedding and then extends them to phrase alignments using a heuristic approach. Then, it composes a phrase representation from word embedding and searches for a set of consistent phrase alignments on a lattice of phrase alignment candidates. The experimental results in this study on the English dataset show that our method outperforms the previous phrase aligner. We also constructed a Japanese dataset for analysis, confirming that our method works with languages other than English.