転移学習を用いたVideo Vision Transformerによる単語レベル手話認識

伊藤 啓; 孫 宜蒙; 中口 孝雄; 今井 正治

doi:10.11517/pjsai.JSAI2025.0_3Win538

39th (2025)

Session ID : 3Win5-38

DOI https://doi.org/10.11517/pjsai.JSAI2025.0_3Win538

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 39th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 39

Location : [in Japanese]

Date : May 27, 2025 - May 30, 2025

Word-Level Sign Language Recognition with Video Vision Transformer using Transfer Learning

*Kei ITO, Yimeng SUN, Takao NAKAGUCHI, Masaharu IMAI

Author information

Keywords: Sign Language, Machine Translation, Transfer Learning, Video Vision Transformer, ViViT

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

Real-time communication between individuals with hearing impairments and hearing individuals who have not mastered sign language remains challenging. Machine translation of sign language is essential for promoting social inclusion for people with hearing impairments. Since the introduction of Convolutional Neural Networks (CNNs), the accuracy of sign language translation has improved significantly. However, alternative approaches leveraging Transformer models are also being explored. The Video Vision Transformer, an extension of the Transformer model designed for video recognition, allows for the direct input of video data. However, to improve accuracy, preprocessing of input data is required.In this study, we fine-tuned a Video Vision Transformer pretrained on the Kinetics-400 video dataset and evaluated its performance in word-level sign language recognition using two widely recognized sign language datasets (LSA64 and WLASL100). As a result, we achieved accuracy comparable to previous studies without the need for data preprocessing.

Corresponding author

Conference information

Register with J-STAGE for free!