2025 Volume 21 Issue 2 Pages 53-61
Creating parallel corpora from texts with similar plots but from different historical periods is valuable for efficient diachronic comparative studies and quantitative analysis. This paper examines methods for automatic word alignment in historical Japanese texts, focusing on The Tales of the Heike (Amakusa Edition) and its vernacular translation source.
A straightforward approach to word alignment is to use edit distance between lemma strings, but this method faces difficulties in identifying “substitution relationships between different words.” To address this limitation, we employ Word2Vec, a word vector model that represents semantic similarities between words numerically, enabling more accurate alignment than simple edit distance metrics.