マルチモーダル情報に基づく相槌的応答と表情強度の推定

上野 諒祐; 坂戸 達陽; 中野 有紀子

doi:10.11517/pjsai.JSAI2021.0_3E2OS5b01

Abstract

Providing feedback to a speaker is an essential communication signal for maintaining a conversation. In addition to verbal feedback responses, facial expressions are also effective modalities to convey the listener's response to the speaker's utterances. Moreover, not only the type of facial expressions, but also the degree of intensity of the expression may influence the meaning of the specific feedback. In this study, we propose a multimodal deep neural network model that predicts the intensity of facial expressions co-occurring with feedback responses. We collected 33 video-mediated conversations by groups of three people and obtained language, facial and audio data for each participant. We also annotated feedback responses and clustered their BERT-embedding expressions to classify feedback responses. In the proposed method, a decoder with attention mechanism for audio, visual, and language modalities produce the intensity for the 17 AUs frame by frame and a classifier of feedback labels were trained by multi-task learning. In the evaluation of the prediction performance of the feedback label, there was a bias in the prediction performance depending on the category. For AU intensity prediction, the multi-task model had a smaller loss function value (loss) than the single-task model, indicating a better model.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!