2019 Volume E102.D Issue 2 Pages 331-345
Aimed at long-term monitoring of daily office conversations without recording the conversational content, a system is presented for estimating acoustic nonverbal information such as utterance duration, utterance frequency, and turn-taking. The system combines a sound localization technique based on the sound energy distribution with 16 beam-forming microphone-array modules mounted in the ceiling for reducing the influence of multiple sound reflection. Furthermore, human detection using a wide field of view camera is integrated to the system for more robust speaker estimation. The system estimates the speaker for each utterance and calculates nonverbal information based on it. An evaluation analyzing data collected over ten 12-hour workdays in an office with three assigned workers showed that the system had 72% speech segmentation detection accuracy and 86% speaker identification accuracy when utterances were correctly detected. Even with false voice detection and incorrect speaker identification and even in cases where the participants frequently made noise or where seven participants had gathered together for a discussion, the order of the amount of calculated acoustic nonverbal information uttered by the participants coincided with that based on human-coded acoustic nonverbal information. Continuous analysis of communication dynamics such as dominance and conversation participation roles through nonverbal information will reveal the dynamics of a group. The main contribution of this study is to demonstrate the feasibility of unconstrained long-term monitoring of daily office activity through acoustic nonverbal information.