2015 Volume 2015 Issue AGI-001 Pages 06-
We propose a Japanese sign language recognition system combining Convolutional Neural Network (CNN) and Long-Short Term Memory (LSTM). Existing research has had two problems. First, it has assumed that sign language could be recognized by extracting hand/arm positions and directions as features although non-manual signals play an important role in sign language. Second, it has divided temporal structure by using velocity of the hands or the movement section of the hands. However, this assumption might have left out the complex temporal structure of sign language. In this research, we created a dataset of movies of the upper bodies of sign language signers by using Kinect version2. In order to extract the effective features that include non-manual signals, we put the visible images and depth images of the dataset into CNN by frames. Then the extracted features were put into LSTM frame by frame to capture the complex temporal structure of sign language. We trained our whole network by using the backpropagation algorithm. Comparing this CNN-LSTM model to control models, we suggest that this model is more effective for sign language recognition.