論文ID: e25.75
This paper presents a novel approach for speech synthesis using articulatory movements captured by real-time magnetic resonance imaging (rtMRI), focusing on fundamental frequency (F0) estimation mechanisms. Although recent rtMRI-based methods have achieved promising results, it remains unclear how F0 information is reproduced, given rtMRI's limited ability to capture vocal fold vibrations. To address this gap, we propose a speech synthesis method that processes only four consecutive rtMRI frames (~150 ms)—preventing reliance on extended linguistic context to infer F0. Our method employs an EfficientNetV2-BiLSTM network that enables sophisticated F0-related feature extraction for mel-spectrogram estimation, followed by a HiFi-GAN vocoder for high-fidelity waveform generation. Evaluations on the ATR 503 sentences rtMRI database demonstrate intelligible speech synthesis with accurate F0 reproduction. Building on these results, we further estimate F0 from single MRI frames, confirming that F0 can be derived without temporal context. To explore the underlying basis, we apply optical flow analysis to visualize subtle articulatory differences associated with F0 control, primarily revealing upward/forward larynx and tongue shifts with increasing F0. Additionally, distinct patterns were observed in male speakers at low F0 ranges. These findings empirically validate the relationship between articulatory configurations and F0 control, demonstrating feasibility in rtMRI-based speech synthesis.