抄録
This paper proposes a method to estimate sound source location by fusing auditory and visual information with Bayesian net. Since there are several auditory categories corresponding to different visual features, the sound signal is firstly classified into speech and non-speech categories, each of which correlate with skin-color and other color features distributed in the image, respectively. After modeling skin-color feature with Gaussian mixture model, we introduce Bayesian net to infer whether the pixels in the image correspond to sound source or not. Finally, the experimental results are presented to show the effectiveness of the proposed method.