Noise-aware Character Alignment for Extracting Transliteration Fragments

Katsuhito Sudoh; Shinsuke Mori; Masaaki Nagata

doi:10.11185/imt.10.88

Media (processing) and Interaction

Noise-aware Character Alignment for Extracting Transliteration Fragments

Katsuhito Sudoh, Shinsuke Mori, Masaaki Nagata

著者情報

キーワード: Statistical Machine Transliteration, Bayesian Many-to-many Alignment, Machine Translation

ジャーナルフリー

2015 年 10 巻 1 号 p. 88-112

DOI https://doi.org/10.11185/imt.10.88

詳細

抄録

This paper proposes a novel noise-aware character alignment method for automatically extracting transliteration fragments in phrase pairs that are extracted from parallel corpora. The proposed method extends a many-to-many Bayesian character alignment method by distinguishing transliteration (signal) parts from non-transliteration (noise) parts. The model can be trained efficiently by a state-based blocked Gibbs sampling algorithm with signal and noise states. The proposed method bootstraps statistical machine transliteration using the extracted transliteration fragments to train transliteration models. In experiments using Japanese-English patent data, the proposed method was able to extract transliteration fragments with much less noise than an IBM-model-based baseline, and achieved better transliteration performance than sample-wise extraction in transliteration bootstrapping.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）