Article ID: 2024EDP7297
Sign language recognition (SLR) using a video is a challenging problem. In the SLR problem, I3D network, which has been proposed for action recognition problems, is the best performing model. However, the action recognition and SLR are inherently different problems. Therefore, there is room to develop it for the SLR problem to achieve better performance, considering the task-specific features of SLR. In this work, we revisit I3D model to extend its performance in three essential design aspects. They include a better inception module named dilated inception module (DIM) and an attention mechanism-based temporal attention module (TAM) to identify the essential features of signs. In addition, we propose to eliminate a loss function that deteriorate the performance. The proposed method has been extensively validated on WLASL and MS-ASL public datasets. The proposed method has outperformed the state-of-the-art approaches in WLSAL dataset and produced competitive results on MS-ASL dataset, though the results of MS-ASL dataset are indicative due to unavailability of the original data. The Top-1 accuracy of the proposed method on WLASL100 and MS-ASL100 were 79.08% and 82.78%, respectively.