2024 年 26 巻 4 号 p. 391-398
Despite recent successes in nonverbal human-computer interaction (HCI) facilitated by deep learning methods, sign language translation for HCI remains underexplored. In this paper, we analyze and develop a sign language translation system that can recognize continuous signs and convert the sign meanings into natural spoken sentences in an end-to-end manner. We believe this system will enhance the interaction between computers and “deaf and hard-of-hearing individuals”. In developing this sign language translation system, we introduced high-quality sign embedding to extract informative spatial-temporal representation from continuous sign motions and adopted label smoothing to the training criteria to mitigate the overfitting issue. The proposed methods, therefore, help narrow the modality gap between vision (sign language) and language (spoken sentence). We conducted the experiments with the proposed methods on the PHOENIX14T dataset, yielding significantly improved results (WER↓: 29.20→20.79, BLEU-4↑: 21.12→ 24.56).