複数Transformer Encoderの統合による骨格座標ベース手話認識

竹田 詩韻; 張英 夏; 向井 信彦

doi:10.11371/iieej.53.166

Abstract

Various studies on sign language recognition are conducted around the world. In particular, RGB image based methods are often used. This approach, however, includes the potential of its degrading accuracy, because it also learns the features included in the background. In addition, methods that use whole image as input cannot represent local features such as hand and arm movements. Therefore, in this study, we aim to improve the accuracy of sign language recognition by using a skeleton-based deep learning model with integration of multiple transformer encoders that utilizes the skeletal coordinate change and represents both global and local features. The skeletal coordinates obtained by Mediapipe are divided into four parts and four trained models are created individually. As the result of the experiments with the American sign language dataset WLASL (Word-Level American Sign Language) as the training data, the recognition accuracy of the proposed method improved more than that of color based methods, and we have confirmed the effectiveness.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!