Functions have been implemented in various robots to enable them to follow a conversation protocol. The paralinguistic information involved in prosody and posture expression is used to improve the transparency of the conversational states, especially the protocol, thereby effectively contributing to natural and efficient communication. Information is communicated incrementally to enable error handling. Various rules for selecting conversation participants, forming a communication group, and turn-taking are followed. Since all the actions of a conversational robot are explicitly controlled, such robots should be useful for revealing important heretofore unknown conversational functions.
Speech conveys not only linguistic information but also supplemental information that is not inferable from written language, such as attitude, speaking style, intention, emotion, mental state, and so on, and is called para- or non-linguistic information. This type of information plays important roles for smooth and natural communication through spoken language. This paper reviews recognition and synthesis techniques for speech communication focusing on emotion and emphasis as well as corpora that are dispensable to development of current speech technologies.
Beamforming with a microphone-array is an ideal candidate for distant-talking speech recognition. An adaptive beamformer can achieve beamforming with a small microphone-array, but it had difficulty extracting distant-moving speech and reducing moving noises, because it must rapidly train long multiple-channel adaptive filters by using observed noises with a microphone-array. However, if positions of both talkers and noises can be estimated, adaptive filters may not need to be trained in real noisy environments. Therefore, we propose a multiple-nulls-steering beamformer based on both talker and noise localization that does not require adaptive training with observed noises. Finally, we confirmed the validity and effectiveness of the proposed method through computer simulations and evaluation experiments in real noisy environments.
A case study on the correlation between phonation type and paralinguistic information in Japanese has carried out using a high-speed digital video imaging system. The results showed that ``breathy'' and ``creaky'' phonations corresponded to ``disappointment'' and ``suspicion''-related utterances, respectively. Such influence of paralinguistic information stretches over segments including voiceless consonants. This means the alteration due to paralinguistic information is not limited to voice quality but to whole settings of the larynx. These findings are in accord with those of our articulatory study. They suggest that the domain of the phonatory and articulatory setting due to paralinguistic information is the whole utterances, rather than individual segments.
Within the context of English language taught solely using English language at Japan's secondary schools, no research quantifies the differences between native instructors (first language English, may or may not speak Japanese) and non-native instructors (first language Japanese; second language English). We developed a video corpus of an English language classroom, and examined the speech of 3 native and 1 non-native instructors. The corpus contains 49 English lessons of 45 minutes each in a Japanese public high school with monolingual learners of English as a foreign language. The native and non-native instructors occasionally taught together. Almost all speech in the lessons was in English. We compared lexical tokens and types found in our transcriptions with a collection of typical classroom English dialogues, and a wordlist created from large bodies of written and spoken English. We obtained the distributions of words, and words preferred by either native or non-native instructors. Results suggest that (a) native and non-native instructors share a core vocabulary of classroom English, (b) native instructors teach vocabulary depth via open-ended conversations, (c) non-native instructors teach vocabulary breadth via textbook explanations, and (d) native and non-native instructors differ in teaching roles but not in language ability.
This paper presents a set of low-complexity tools used in lossless coding of G.711 bitstream, based on linear prediction. One is an algorithm for quantizing the PARCOR/reflection coefficients and the other is an estimation method for the optimal prediction order. Both tools are based on a criterion that minimizes the entropy of the prediction residual signal and can be implemented in fixed-point arithmetic at very low-complexity. Since proposed methods show efficient performance in terms of compression and complexity, they are adopted in the Recommendation ITU-T G.711.0, a new standard for lossless compression of G.711 (A-law/μ-law logarithmic PCM) payload.
A method of computing the acoustic characteristics of a simplified three-dimensional vocal-tract model with wall impedance is presented. The acoustic field is represented in terms of both plane waves and higher order modes in tubes. This model is constructed using an asymmetrically connected structure of rectangular acoustic tubes, and can parametrically represent acoustic characteristics at higher frequencies where the assumption of plane wave propagation does not hold. The propagation constants of the higher order modes are calculated taking account of wall impedance. The resonance characteristics of the vocal-tract model are evaluated using the radiated acoustic power. Computational results show an increase in bandwidth and a small upward shift of peaks, particularly at lower frequencies, as already suggested by the one-dimensional model. It is also shown that the sharp peaks at higher frequencies are less sensitive to the values of wall impedance even though the attenuation of the higher order modes is larger than that of plane waves.
A minimum generation error (MGE) criterion has been proposed for model training in hidden Markov model (HMM)-based speech synthesis to minimize the error between generated and original static parameter sequences of speech. However, dynamic properties of speech parameters are ignored in the generation error definition. In this study, we incorporate these dynamic properties into MGE training by introducing the error component of dynamic features (i.e., delta and delta-delta parameters) into the generation error function. We propose two methods for setting the weight associated with the additional error component. In the fixed weighting approach, this weight is kept constant over the course of speech. In the adaptive weighting approach, it is adjusted according to the degree of dynamicity of speech segments. An objective evaluation shows that the newly derived MGE criterion with the adaptive weighting method results in comparable performance for the static feature and better performance for the delta feature compared with the baseline MGE criterion. Subjective listening tests exhibit a small but statistically significant improvement in the quality of speech synthesized by the proposed technique. The newly derived criterion improves the capability of HMMs in capturing dynamic properties of speech without increasing the computational complexity of the training process compared with the baseline criterion.
In this paper, we propose a new noise suppression method, that is best used as a preprocessor for time-lag speech recognition. Assuming that a time lag of a few seconds is acceptable in various speech recognition applications, the proposed method is realized as a combination of forward and backward estimation flows over time. Each estimation flow is based on the optimally modified log spectral amplitude (OM-LSA) speech estimator, but a look-ahead estimation mechanism is additionally equipped to make the estimation more robust. Evaluation experiments using various databases confirm that the speech recognition accuracy can be greatly improved by adding the proposed method to the existing system.
We first compared a speech signal with two reverberations, normal reverberation and its time-reversed version, that have the same modulation transfer function. Results showed that intelligibility of speech with the time-reversed reverberation was significantly less than that with the normal reverberation. We then compared the results of human speech recognition (HSR) with those of automatic speech recognition (ASR) to see whether a similar tendency could be observed in both cases. Results showed the similar asymmetry in ASR, but we found the HSR was more tolerant even though reverberation becomes longer. Finally, we discussed factors of asymmetric temporal properties in speech production and perception that current speech recognizers do not have.