2025 年 43 巻 9 号 p. 919-922
This paper proposes a speech synthesis model from multimodal information, i.e., text and lip movements, in order to generate more natural speeches including voiced and unvoiced sections. Its architecture consists of an image feature extractor using an auto-encoder and an encoder-decoder model that outputs a mel-spectrogram. Speech synthesis reflecting the lip movements to the text was successfully achieved. 3 types of combinations between text and lip movements were compared and evaluated.