This paper describes the Asian English Speech cOrpus Project (AESOP), its aims, data collection platform, and auto-annotation method. It also describes the phonemic variation in Japanese speakers' reading of The North Wind and the Sun of the Corpus in relation to the English fluency levels of the Japanese speakers. The results showed that the main segmental difference in pronunciation between Japanese speakers and model native English speakers was vowels. Japanese speakers produced more variants of vowel phonemes; vowel reduction in unstressed syllables did not occur in most speakers' utterances. There was also an influence of Japanese syllable structure in Japanese speakers' English utterances; there were many instances of vowel epenthesis to break up consonant clusters, but there were very few examples of vowel deletion and consonant insertion.
This paper describes how speech and text corpora were used in developing OJAD (Online Japanese Accent Dictionary), an online Japanese prosody teaching/learning system. Current problems related to teaching Japanese prosody are summarized, and the relationship between these problems and system development is explained. A corpus of spoken verbs along with their conjugations was used to build a module to conduct verb accent search. A text corpus of sentences with both accentual phrase boundaries and accent nuclei labeled was used to train a boundary detector and an accent nucleus detector. These detectors were used to construct a prosodic reading tutor. Subjective assessment was done by 80 teachers of Japanese to both verb accent search module and prosodic reading tutor. All the teachers assessed the search module as "very effective" or "effective to some degree." The reading tutor was evaluated by 73 teachers as "very effective" or "effective to some degree." These results indicate high effectiveness of the two systems. Through the development of OJAD, the author has become aware that there are still gaps to be bridged in communication between Japanese teachers and speech engineers pertaining to the needs of the former and the technology being made available by the latter. These gaps are pointed out in the future directions.
The extent to which language learners hear non-native sounds in terms of native categories depends in part on acoustic and auditory similarities between the two sets of sounds. One unresolved issue is the choice of parameter space in which similarity should be measured. The current paper demonstrates the application of an unsupervised, corpusbased, data-driven mapping technique which permits the use of rich, high-dimensional data representations, obviating the need for prior commitment to specific low-order speech parameters such as formant frequencies. The approach, known as generative topographic mapping, preserves the structure of the high-dimensional space while mapping to a lower-dimensional space. We show how this low-dimensional latent space can be used for tasks such as visualising the location of L2 consonants in an existing L1 space and measuring the effect of L2 exposure on the representation of both L2 and L1 consonants by comparison with data from a behavioural study in which Chinese listeners underwent an intensive training regime on Spanish consonants.
This paper introduces an automatic Mandarin pronunciation evaluation method, which aims at building a computer-based system to partly replace human examiners in the Putonghua Shuiping Ceshi (PSC) in China. This method learns the mapping relationship between the recorded speech waveforms and the score of pronunciation proficiency by a statistical modeling approach, which is composed of three main modules: the frontend module, the evaluation feature extraction module and the mapping module. In the frontend module, hidden Markov model (HMM)-based acoustic models are constructed to describe the distribution of acoustic features for standard pronunciation. In the evaluation feature extraction module, posterior probabilities are calculated for segmental and tonal acoustic features of speech data from each examinee using the trained acoustic models. These posterior probabilities together with a duration feature compose the feature vector for predicting pronunciation scores. Finally, in the mapping module, piecewise linear regression is introduced to map the evaluation feature vector into a pronunciation score for each examinee. The piecewise linear regression is achieved by cascading an SVM classifier and a linear regression for each class in our implementation. An experiment on evaluating the real PSC test data of 5,420 speakers shows that the system constructed using our proposed method achieved a correlation of 0.901 between the predicted scores and the scores given by human examiners for the first three sections of PSC test. Another experiment which compared the performance of our system with 20 human examiners shows that our system ranked 2nd and outperformed most of the human examiners in terms of evaluation accuracy.
To develop an automatic emotion estimation system based on speaker information collected during face-to-face conversation, an extensive exploration of the multimodal features of speakers is required. To satisfy this requirement, a multimodal Japanese dialog corpus with dynamic emotional states was created by recording the vocal and facial expressions and physiological reactions of various speakers. Estimation experiments based on a mixed-effect model and multiple regression analysis were conducted to elucidate the relevant features for speaker-independent and speaker-specific emotion estimation. The results revealed that vocal features were most relevant for speaker-independent emotion estimation, whereas facial features were most relevant for speaker-specific emotion estimation.