抄録
Voice activity detection (VAD) is an essential technique to develop a sophisticated voice interface. However, VAD with sufficient detection capability has not been presented yet. In particular, it is difficult that the beginning and ending of a word are accurately detected in noisy environments. In this paper, we describe extended models with multi-condition training (extended MC-models) for misdetection and evaluate their noise robustness by a large amount of word recognition simulations. From the results of the simulations, simple whole-word models degraded recognition performance when input speech signal was accompanied by non-speech segments, whereas the extended MC-models maintained the performance. Furthermore, in consideration of practical applications, we carried out the simulations combining CENSREC-1-C baseline VAD with the extended MC-models. The results also showed the usefulness of the extended MC-models under 20 and 10dB signal-to-noise ratio conditions.