Host: The Japanese Society for Artificial Intelligence
Name : The 39th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 39
Location : [in Japanese]
Date : May 27, 2025 - May 30, 2025
Real-time communication between individuals with hearing impairments and hearing individuals who have not mastered sign language remains challenging. Machine translation of sign language is essential for promoting social inclusion for people with hearing impairments. Since the introduction of Convolutional Neural Networks (CNNs), the accuracy of sign language translation has improved significantly. However, alternative approaches leveraging Transformer models are also being explored. The Video Vision Transformer, an extension of the Transformer model designed for video recognition, allows for the direct input of video data. However, to improve accuracy, preprocessing of input data is required.In this study, we fine-tuned a Video Vision Transformer pretrained on the Kinetics-400 video dataset and evaluated its performance in word-level sign language recognition using two widely recognized sign language datasets (LSA64 and WLASL100). As a result, we achieved accuracy comparable to previous studies without the need for data preprocessing.