Journal of Robotics and Mechatronics
Online ISSN : 1883-8049
Print ISSN : 0915-3942
ISSN-L : 0915-3942
Special Issue on Robot Audition Technologies
Psychologically-Inspired Audio-Visual Speech Recognition Using Coarse Speech Recognition and Missing Feature Theory
Kazuhiro NakadaiTomoaki Koiwa
著者情報
ジャーナル オープンアクセス

2017 年 29 巻 1 号 p. 105-113

詳細
抄録

Audio-visual speech recognition (AVSR) is a promising approach to improving the noise robustness of speech recognition in the real world. For AVSR, the auditory and visual units are the phoneme and viseme, respectively. However, these are often misclassified in the real world because of noisy input. To solve this problem, we propose two psychologically-inspired approaches. One is audio-visual integration based on missing feature theory (MFT) to cope with missing or unreliable audio and visual features for recognition. The other is phoneme and viseme grouping based on coarse-to-fine recognition. Preliminary experiments show that these two approaches are effective for audio-visual speech recognition. Integration based on MFT with an appropriate weight improves the recognition performance by −5 dB. This is the case even in a noisy environment, in which most speech recognition systems do not work properly. Phoneme and viseme grouping further improved the AVSR performance, particularly at a low signal-to-noise ratio.*

* This work is an extension of our publication “Tomoaki Koiwa et al.: Coarse speech recognition by audio-visual integration based on missing feature theory, IROS 2007, pp. 1751-1756, 2007.”

著者関連情報

この記事は最新の被引用情報を取得できません。

© 2017 Fuji Technology Press Ltd.

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license (https://creativecommons.org/licenses/by-nd/4.0/).
The journal is fully Open Access under Creative Commons licenses and all articles are free to access at JRM Official Site.
https://www.fujipress.jp/jrm/rb-about/
前の記事 次の記事
feedback
Top