2016 Volume 4 Issue 1 Pages 68-77
A new decision-level fusion (DLF)-based speech segment detection method and its application to audio noise removal for video conferences are presented in this paper. The proposed method calculates visual and audio features from video sequences and audio signals, respectively, obtained in video conferences. Features extracted from mouth regions of participants and attribution degrees of speech class are used as visual and audio features, respectively, and Support Vector Machine (SVM)-based classification is performed by using each kind of feature. The SVM classifier performs two-class classification of speech and non-speech segments to realize speech segment detection. From the detection results obtained from the visual and audio features, DLF based on Supervised Learning from Multiple Experts is performed to successfully obtain the final detection results with focus on the accuracy of each detection result. Then, from audio signals in the non-speech segments detected by our method, we can extract noise information to realize accurate audio noise removal in the speech segments.