Bi-Directional LSTM Networkを用いた発話に伴うジェスチャの自動生成手法

金子 直史; 竹内 健太; 長谷川 大; 白川 真一; 佐久田 博司; 鷲見 和彦

doi:10.1527/tjsai.C-J41

抄録

We present a novel framework for automatic speech-driven natural gesture motion generation. The proposed method consists of two steps. First, based on Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both forward and backward consistencies for a long period of time. The network regresses full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Second, we apply combined temporal filters to smooth out generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and a marker-based motion capture to train our network. We evaluate different acoustic features, network architectures, and temporal filters in order to validate the effectiveness of the proposed approach. We also conduct a subjective evaluation and compare our approach against real human gestures. The subjective evaluation result shows that our generated gestures are comparable to “original” human gestures and are significantly better than “mismatched” human gestures taken from a different utterance in the scale of naturalness.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）