主催: 一般社団法人 日本機械学会
会議名: IIP2023 情報・知能・精密機器部門講演会講演論文集
開催日: 2023/03/06 - 2023/03/07
In recent years, there has been a growing interest in using silent speech to generate speech and recognize speech content for various applications in the fields of medicine, human interaction, and entertainment. Gaddy et al. have developed a machine learning model that can generate speech from muscle activity (using electromyography (EMG)) with 64 % word recognition accuracy for one English speaker. This study examines the possibility of applying this approach to Japanese speech synthesis by fine-tuning a machine learning model with English data, investigating the useful EMG locations that have not been measured in previous studies, examining the feasibility of learning the EMG data from non-speaking individuals paired with speech from other people, and evaluating the effect of using phoneme-balanced sentences for improving the word recognition accuracy. The results of this study suggest that speech synthesis in Japanese is possible with a limited vocabulary, and that fine-tuning with English data improves the accuracy by 40 % relative to not performing fine-tuning. Adding more EMG locations, particularly in the neck (styloglossus and hyoglossus muscle groups), improves the word recognition accuracy by 60 % relative to not adding them. It has also been observed that it is possible to generate speech by learning the EMG data paired with speech from other people, and the usage of phonemebalanced sentences for data creation has been found to be useful. It is expected that the word recognition accuracy will improve with an increase in data in Japanese. Furthermore, it is expected that this technology will be able to generate speech in Japanese with unrestricted vocabulary.