日本語黙声発話時における顔部および頸部周辺の表面筋電位に基づく音声合成システムの開発

小林 叶昌; 楓 和憲; 綿貫 啓一

doi:10.1299/jsmeiip.2023.IIPC-5-11

抄録

In recent years, there has been a growing interest in using silent speech to generate speech and recognize speech content for various applications in the fields of medicine, human interaction, and entertainment. Gaddy et al. have developed a machine learning model that can generate speech from muscle activity (using electromyography (EMG)) with 64 % word recognition accuracy for one English speaker. This study examines the possibility of applying this approach to Japanese speech synthesis by fine-tuning a machine learning model with English data, investigating the useful EMG locations that have not been measured in previous studies, examining the feasibility of learning the EMG data from non-speaking individuals paired with speech from other people, and evaluating the effect of using phoneme-balanced sentences for improving the word recognition accuracy. The results of this study suggest that speech synthesis in Japanese is possible with a limited vocabulary, and that fine-tuning with English data improves the accuracy by 40 % relative to not performing fine-tuning. Adding more EMG locations, particularly in the neck (styloglossus and hyoglossus muscle groups), improves the word recognition accuracy by 60 % relative to not adding them. It has also been observed that it is possible to generate speech by learning the EMG data paired with speech from other people, and the usage of phonemebalanced sentences for data creation has been found to be useful. It is expected that the word recognition accuracy will improve with an increase in data in Japanese. Furthermore, it is expected that this technology will be able to generate speech in Japanese with unrestricted vocabulary.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

会員向け購読者番号とパスワードは以下URLよりご確認下さい。
https://www.jsme.or.jp/publication/proceedings/

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）