深層学習による映像内の話者特定のための特徴量の抽出

大村 実誉; 松下 裕; 鈴森 淳之介

doi:10.14864/fss.39.0_445

Session ID : 2C3-3

DOI https://doi.org/10.14864/fss.39.0_445

Conference information

Name : 39th Fuzzy System Symposium

Number : 39

Location : [in Japanese]

Date : September 05, 2023 - September 07, 2023

proceeding

Extraction of features to identify a speaker in videos by using a deep learning

*Minori Omura, Yutaka Matsushita, Junnosuke Suzumori

Author information

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

This study examines whether discriminant analysis or neural networks are more effective in predicting the utterance or no utterance, using lip features as explanatory variables. First, the maximum amplitude and frequency derived from a lip movement wave and the coordinates of the four fixed points in a lip are defined as feature values. As for the coordinates, three cases are set where both x and y coordinates and one of them are used. Second, by applying these feature values to discriminant analysis and neural networks, the utterance or no utterance is predicted. Consequently, it is shown that a neural network in which only the y-coordinate of lips is used as the explanatory variables guarantees high prediction accuracy.

Corresponding author

Conference information

Register with J-STAGE for free!