2021 Volume 29 Pages 275-282
In this paper, we propose a replay attack detection (RAD) method that uses spatial and spectral features of a stereo signal. To distinguish genuine and replayed utterance, we focus on non-speech segments, in which a human does not emit sound, but a loudspeaker for replay attack might emit some recorded noise or its electromagnetic noise. The generalized cross-correlation (GCC) based spatial features capture this difference. To improve the robustness against the variety of recording environments, we combine the spatial features with spectral features. In particular, we fuse the output scores of GCC-based and spectral feature-based methods. In experiments, we confirm the effectiveness of the combination of spatial and spectral features.