テキストと口唇動画像データによるマルチモーダル音声合成器の性能評価

松浦 篤史; 清水 創太

doi:10.7210/jrsj.43.919

日本ロボット学会誌

Online ISSN : 1884-7145
Print ISSN : 0289-1824
ISSN-L : 0289-1824

J-STAGEトップ
/
日本ロボット学会誌
/
43 巻 (2025) 9 号
/
書誌

論文

テキストと口唇動画像データによるマルチモーダル音声合成器の性能評価

松浦篤史, 清水創太

著者情報

ジャーナルフリー

2025 年 43 巻 9 号 p. 919-922

DOI https://doi.org/10.7210/jrsj.43.919

詳細

抄録

This paper proposes a speech synthesis model from multimodal information, i.e., text and lip movements, in order to generate more natural speeches including voiced and unvoiced sections. Its architecture consists of an image feature extractor using an auto-encoder and an encoder-decoder model that outputs a mel-spectrogram. Speech synthesis reflecting the lip movements to the text was successfully achieved. 3 types of combinations between text and lip movements were compared and evaluated.

著者関連情報

お気に入り & アラート

お気に入りに追加
追加情報アラート
被引用アラート
認証解除アラート

閲覧履歴

[title in Japanese]
Erratum to : Analysis of Post-Harvest Application Pesticides in Citrus Fruits

feedback

Top

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）