Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
39th (2025)
Session ID : 3Win5-38
Conference information

Word-Level Sign Language Recognition with Video Vision Transformer using Transfer Learning
*Kei ITOYimeng SUNTakao NAKAGUCHIMasaharu IMAI
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Real-time communication between individuals with hearing impairments and hearing individuals who have not mastered sign language remains challenging. Machine translation of sign language is essential for promoting social inclusion for people with hearing impairments. Since the introduction of Convolutional Neural Networks (CNNs), the accuracy of sign language translation has improved significantly. However, alternative approaches leveraging Transformer models are also being explored. The Video Vision Transformer, an extension of the Transformer model designed for video recognition, allows for the direct input of video data. However, to improve accuracy, preprocessing of input data is required.In this study, we fine-tuned a Video Vision Transformer pretrained on the Kinetics-400 video dataset and evaluated its performance in word-level sign language recognition using two widely recognized sign language datasets (LSA64 and WLASL100). As a result, we achieved accuracy comparable to previous studies without the need for data preprocessing.

Content from these authors
© 2025 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top