2015 年 19 巻 1 号 p. 53-67
To develop an automatic emotion estimation system based on speaker information collected during face-to-face conversation, an extensive exploration of the multimodal features of speakers is required. To satisfy this requirement, a multimodal Japanese dialog corpus with dynamic emotional states was created by recording the vocal and facial expressions and physiological reactions of various speakers. Estimation experiments based on a mixed-effect model and multiple regression analysis were conducted to elucidate the relevant features for speaker-independent and speaker-specific emotion estimation. The results revealed that vocal features were most relevant for speaker-independent emotion estimation, whereas facial features were most relevant for speaker-specific emotion estimation.