For Japanese speech processing, being able to automatically recognize between geminate and singleton consonants can have many benefits. In standard recognition methods, hidden Markov Models (HMMs) are used. However, HMMs are not good at differentiating between items that are distinguished primarily by temporal differences rather than spectral differences. Also, gemination depends on the length of the sounds surrounding the consonant. Because of this, we propose the construction of a method that automatically distinguishes geminates from singletons and takes these factors into account. In order to do this, it is necessary to determine which surrounding sounds are cues and what the mechanism of human recognition is. For this, we conduct perceptual experiments to examine the relationship between surrounding sounds and primary cues. Then, using these results, we design a method that can automatically recognize gemination. We test this method on two datasets including a speaking rate database. The results attained well-outperform the HMM-based method and overall outperform the case when only the primary cue is used for recognition as well as show more robustness against speaking rate.
Most previous studies using the dimensional approach mainly focused on the direct relationship between acoustic features and emotion dimensions (valence, activation, and dominance). However, the acoustic features that correlate to valence dimension are very few and very weak. As a result, the valence dimension has been particularly difficult to predict. The purpose of this research is to construct a speech emotion recognition system that has the ability to precisely estimate values of emotion dimensions especially valence. This paper proposes a three-layer model to improve the estimating values of emotion dimensions from acoustic features. The proposed model consists of three layers: emotion dimensions in the top layer, semantic primitives in the middle layer, and acoustic features in the bottom layer. First, a top-down acoustic feature selection method based on this model was conducted to select the most relevant acoustic features for each emotion dimension. Then, a button-up method was used to estimate values of emotion dimensions from acoustic features by firstly using fuzzy inference system (FIS) to estimate the degree of each semantic primitive from acoustic features, then using another FIS to estimate values of emotion dimensions from the estimated degrees of semantic primitives. The experimental results reveal that the constructed emotion recognition system based on the proposed three-layer model outperforms the conventional system.
We propose a sound field simulation method for a circular array. In a linear array, when the distance between the loudspeaker units is equal to the interval between the observation points on the lines paralleled to the array, the sound pressures on the observation line can be calculated by the spatial convolution of the set of transfer functions and loudspeakers' driving signals. To apply this idea to a circular array, we developed a simulation method with equiangular observation points on the circle. The spatial circular convolution without zero padding, which is necessary in a linear array, can be used with this method. By conducting convolution based on the fast Fourier transform (FFT), the computational complexity is greatly reduced. Moreover, assuming that non-active loudspeakers are included in the loudspeaker array, the proposed method can be applied to an unequal interval array. For example, when the number of observation points is set to 128 and the number of loudspeakers is set to 32, circular convolution with FFT reduces the computational complexity to 75% compared to the conventional method. In addition, we argue that this method can be applied to a room in which the first reflected sounds are reflected from the floor. The proposed method is useful for simulating the sound field for a circular array when the suitable spatial samplings of the circumferential and radial directions are set.
With the recent progresses in computer performance and simulation techniques, it is becoming feasible to apply full three-dimensional wave-based numerical simulation techniques to large-scale problems of real-life sound propagation outdoors. In the present paper, a reconstruction technique for real-life urban geometries with full reproduction of the roof shapes and for the ground profiles using digital geographic information is presented. Also, a generation technique for the uniform rectilinear grid used in finite-difference time-domain simulations is presented. The types of geographic dataset used for the reconstruction are a digital surface model and a two-dimensional building outline map. For comparison, another geometry with flat building roofs, which is the type of geometry used in former noise-mapping studies using empirical models, is created. Comparison of the results of finite-difference time-domain acoustic simulations performed over the geometries shows sound pressure level differences above and behind buildings. The maximum level difference of 10 dB in magnitude indicates the necessity of proper reconstruction of the roof shapes in real-life urban acoustic simulations.