Research in natural language processing has seen growing interest in automatically detecting and analyzing words whose meanings evolve over time from corpora. While diachronic corpora and evaluation word lists have been established for languages like English and German, such resources are lacking for Japanese. This study addresses this gap by introducing the Japanese Lexical Semantic Change Detection Dataset (JaSemChange), which has a list of evaluation words for Japanese. Leveraging three diachronic corpora spanning near modern to contemporary Japanese, we sampled usages of target words as pairs. A team of four experts annotated a total of 2,280 usage pairs of target words with semantic similarity to gauge the degree of semantic change. Furthermore, we assessed the performance of word embedding-based methods in detecting semantic change using this dataset. In addition to using frequency-based methods as a baseline, we compared the effectiveness of typical type-based and token-based methods and explored their respective characteristics. The dataset, covering the list of words assigned a degree of semantic change and the annotation scores for the usage pairs, is publicly available on GitHub.
View full abstract