Psychologically-Inspired Audio-Visual Speech Recognition Using Coarse Speech Recognition and Missing Feature Theory

Kazuhiro Nakadai; Tomoaki Koiwa

doi:10.20965/jrm.2017.p0105

抄録

Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.^*

^* This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp. 1751-1756, 2007.”

著者関連情報

この記事は最新の被引用情報を取得できません。

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license (https://creativecommons.org/licenses/by-nd/4.0/).
The journal is fully Open Access under Creative Commons licenses and all articles are free to access at JRM Official Site.
https://www.fujipress.jp/jrm/rb-about/

お気に入り & アラート

閲覧履歴

創刊号からの全論文のPDFは
JRM 公式サイトで公開中(無料)
doiリンクをクリック！

責任著者(Corresponding author)

訂正情報

J-STAGEへの登録はこちら（無料）