2019 年 40 巻 6 号 p. 374-381
This paper proposes an encoder-decoder based sequence-to-sequence model for Grapheme-to-Phoneme (G2P) conversion in Bangla (Exonym: Bengali). G2P models are key components in speech recognition and speech synthesis systems as they describe how words are pronounced. Traditional, rule-based models do not perform well in unseen contexts. We propose to adopt a neural machine translation (NMT) model to solve the G2P problem. We used gated recurrent units (GRU) recurrent neural network (RNN) to build our model. In contrast to joint-sequence based G2P models, our encoder-decoder based model has the flexibility of not requiring explicit grapheme-to-phoneme alignment which are not straight forward to perform. We trained our model on a pronunciation dictionary of (approximately) 135,000 entries and obtained a word error rate (WER) of 12.49% which is a significant improvement from the existing rule-based and machine-learning based Bangla G2P models.