Advancing Human-Computer Interaction: End-to-End Sign Language Translation.

Sihan Tan; Katsutoshi Itoyama; Kazuhiro Nakadai

doi:10.11184/his.26.4_391

Abstract

Despite recent successes in nonverbal human-computer interaction (HCI) facilitated by deep learning methods, sign language translation for HCI remains underexplored. In this paper, we analyze and develop a sign language translation system that can recognize continuous signs and convert the sign meanings into natural spoken sentences in an end-to-end manner. We believe this system will enhance the interaction between computers and “deaf and hard-of-hearing individuals”. In developing this sign language translation system, we introduced high-quality sign embedding to extract informative spatial-temporal representation from continuous sign motions and adopted label smoothing to the training criteria to mitigate the overfitting issue. The proposed methods, therefore, help narrow the modality gap between vision (sign language) and language (spoken sentence). We conducted the experiments with the proposed methods on the PHOENIX14T dataset, yielding significantly improved results (WER↓: 29.20→20.79, BLEU-4↑: 21.12→ 24.56).

References

[1] D. Maloney et al., “” talking without a voice” understanding non-verbal communication in social virtual reality,” Proceedings of the ACM on Human-Computer Interaction, vol. 4, no. CSCW2, pp. 1–25, 2020.
[2] J. Urakami and K. Seaborn, “Nonverbal cues in human–robot interaction: A communication studies perspective,” ACM Transactions on Human-Robot Interaction, vol. 12, no. 2, pp. 1–21, 2023.
[3] J. Kim et al., “” i can feel your empathic voice”: effects of nonverbal vocal cues in voice user interface,” in Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–8, 2020.
[4] K. Yin et al., “Including signed languages in natural language processing,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (Online), pp. 7347–7360, Association for Computational Linguistics, 2021.
[5] B. Zhang et al., “Sltunet: A simple unified model for sign language translation,” arXiv preprint arXiv:2305.01778, 2023.
[6] R. Cui et al., “Recurrent convolutional neural networks for continuous sign language recognition by staged optimization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7361–7369, 2017.
[7] P. Molchanov et al., “Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4207–4215, 2016.
[8] A. Graves et al., “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, (New York, NY, USA), pp. 369–376, Association for Computing Machinery, 2006.
[9] H. Liu et al., “Connectionist temporal classification with maximum entropy regularization,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[10] Y. Chen et al., “A simple multi-modality transfer learning baseline for sign language translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5120–5130, 2022.
[11] K. Yin and J. Read, “Better sign language translation with STMC-transformer,” in Proceedings of the 28th International Conference on Computational Linguistics, (Barcelona, Spain (Online)), pp. 5975–5989, International Committee on Computational Linguistics, 2020.
[12] T. Miyazaki et al., “Machine translation from spoken language to sign language using pre-trained language model as encoder,” in Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives, (Marseille, France), pp. 139–144, European Language Resources Association (ELRA), 2020.
[13] A. Moryossef et al., “Data augmentation for sign language gloss translation,” in Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL), (Virtual), pp. 1–11, Association for Machine Translation in the Americas, 2021.
[14] N. C. Camgoz et al., “Neural sign language translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[15] N. C. Camgoz et al., “Sign language transformers: Joint end-to-end sign language recognition and translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[16] N. Kalchbrenner and P. Blunsom, “Recurrent continuous translation models,” in Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1700–1709, 2013.
[17] N. Kalchbrenner et al., “Neural machine translation in linear time,” 2016.
[18] D. Bahdanau et al., “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[19] M.-T. Luong et al., “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
[20] A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[21] A. Hao et al., “Self-mutual distillation learning for continuous sign language recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11303–11312, 2021.
[22] K. Papineni et al., “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, (USA), pp. 311–318, Association for Computational Linguistics, 2002.
[23] H. Zhou et al., “Improving sign language translation with monolingual data by sign backtranslation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1316–1325, 2021.
[24] H. Zhou et al., “Spatial-temporal multi-cue network for sign language recognition and translation,” IEEE Transactions on Multimedia, vol. 24, pp. 768–779, 2022.
[25] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[26] Y. Min et al., “Visual alignment constraint for continuous sign language recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11542–11551, 2021.

Corresponding author

Register with J-STAGE for free!