自然言語処理
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
一般論文(査読有)
Vowel Articulation and Tongue Position in Vision Language Models
Haruki SakajoYusuke SakaiHidetaka KamigaitoTaro Watanabe
著者情報
ジャーナル フリー

2025 年 32 巻 3 号 p. 859-885

詳細
抄録

How human vocalizations are articulated can be described by analyzing the tongue position. Researchers have discovered this through lived experience and detailed observation, including by MRI. Using this knowledge and personal experience, teachers can understand and explain the relationship between tongue positions and vowels, thus helping language students to learn pronunciation. Our preliminary studies suggest that language models (LMs), trained on extensive data from the linguistic and medical fields, can explain the mechanisms of vowel pronunciation. However, it is unclear whether multimodal LMs, such as Large-scale Vision Language Models (LVLMs), sufficiently align textual information with visual information. From this, the research question arises: Do LVLMs associate real tongue positions with vowel articulation? To investigate whether visual information can help LVLMs understand vowel articulation based on tongue positions, this study created video and image datasets from real-time MRI samples. We discuss how LVLMs predict vowels by analyzing several experimental results. Our findings suggest that LVLMs can potentially identify the interrelationship between vowels and tongue positions when reference examples are provided, but have difficulty without them. LVLMs also perform better when inferring directly from visual information than when offered text descriptions before inferring.

著者関連情報
© 2025 The Association for Natural Language Processing
前の記事 次の記事
feedback
Top