マルチモーダルなマイクロ行動分析に基づく複数人会議の定量化

陳 辰昊; 徳原 耕亮; 荒川 豊; 渡辺 洸; 石丸 翔也

doi:10.11517/pjsai.JSAI2022.0_1P1GS1004

Abstract

In this paper, we present an end-to-end online meeting quantifying system, which can exactly detect and quantify three micro-behavior indicators, speaking, nodding, and smile, for online meeting evaluation. For active speaker detection (ASD), we build a multi-modal neural network framework which consists of audio and video temporal encoders, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. For nodding detection, based on the WHENet framework proposed in the research field of head pose estimation (HPE), we can estimate the head pitch angles as the nodding feature. Then we build a gated recurrent unit (GRU) network with squeeze-and-excitation (SE) module to recognize nodding movement from videos. Finally, we utilize a Haar cascade classifier for smile detection. The experimental results using K-fold Cross Validation show that the F1-score of each detection module achieves 94.9%, 79.67% and 71.19% respectively.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!