Acoustic event and scene analysis has seen extensive development because it is valuable in applications such as monitoring of elderly people and infants, surveillance, life-logging, and advanced multimedia retrieval. This article reviews the basics of acoustic event and scene analysis, including its term and problem definitions, available public datasets, challenges, and recent research trends.
In statistical signal processing and machine learning, an open issue has been how to obtain a generative model that can produce samples from high-dimensional data distributions such as images and speeches. Generative adversarial networks (GANs) have emerged as a powerful framework that provides clues to solving this problem. A GAN is composed of two networks: a generator that transforms noise variables to data space and a discriminator that discriminates real and generated data. These two networks are optimized using a min-max game: the generator attempts to deceive the discriminator by generating data indistinguishable from the real data, while the discriminator attempts not to be deceived by the generator by finding the best discrimination between real and generated data. This novel framework enables the implicit estimation of a data distribution and enables the generator to generate high-fidelity data that are almost indistinguishable from real data. This beneficial and powerful property has attracted a great deal of attention, and a wide range of research, from basic research to practical applications, has been recently conducted. In this paper, I summarize these studies and explain the foundations and applications of GANs. Specifically, I first clarify the relation between GANs and other deep generative models then provide the theory of GANs with numerical formula. Next, I introduce recent advances in GANs and describe the impressive applications that are highly related to acoustic and speech signal processing. Finally, I conclude this paper by mentioning future directions.
The human whistle is a typical aeroacoustic sound. Downstream of a small orifice made by the lips, a jet is formed by airflow with a high Reynolds number. A sequence of vortex rings is then produced, and periodic air pressure changes result in a characteristic whistling sound. Although the vocal tract has been reported to act as an acoustic resonator determining the blowing pitch, the precise shape of the vocal tract and its resonance properties during whistling remain unclear. In the current study, the morphological and acoustic properties of the vocal tract were examined during the act of whistling in a single participant. The vocal tract was scanned in three dimensions using magnetic resonance imaging while four musical notes were produced. The data revealed that the tongue constricted the vocal tract in different ways depending on the note, and the location of the constriction moved forward when the blowing pitch increased. Acoustic analysis of the vocal tract showed that the second peak of the lip input impedance was largely in accord with the whistling pitch. In addition, specific regions in the vocal tract were highly acoustically sensitive to small deformations.
This study provides reference data on vowel devoicing in Japanese spontaneous speech for forensic and other speech investigations. We analysed the running speech of 226 speakers in order to examine how the places of origin of the speakers and their parents influence the occurrence frequency of vowel devoicing. According to the dialect distribution map of vowel devoicing, we classified the speakers into two dialects: the dialect where vowel devoicing occurs frequently (DF) and that where devoicing occurs infrequently (DIF). The results showed that DF speakers with DF parents devoiced the vowels most frequently, while DIF speakers with DIF parents devoiced them the least. The devoicing of speakers with parents of different dialects was in between. Some speakers, irrespective of whether they were DF or DIF, showed a percentage of vowel devoicing that contradicted their dialects while keeping the accentuation and intonation of their dialects. We further examined within-speaker variability in vowel devoicing. We found that the speakers who devoice vowels frequently showed consistent devoicing, whereas those who devoice them infrequently showed occasional and inconsistent devoicing. Forensically, we should not simply judge a speaker's dialect by using the occurrence frequency of vowel devoicing; instead, we should also look at its reproducibility and other dialect-dependent features.
In marine seismic surveys to explore seafloor resources, the structure below the seafloor is estimated from the obtained sound waves, which are emitted by a marine seismic sound source and reflected or refracted between the layers below the seafloor. In order to estimate the structure below the seafloor from returned waves, information of the sound source position and the sound speed are needed. Marine seismic vibrators, which are one of the marine seismic sound sources, have some advantages such as high controllability of the frequency and phase of the sound, and oscillation at a high depth. However, when the sound source position is far from the sea surface, it becomes difficult to specify the exact position. In this paper, we propose a method to estimate the position of a marine seismic vibrator and the sound speed from obtained seismic data by formulating an optimization problem via hyperbolic Radon transform. Numerical simulations confirmed that the proposed method almost achieves theoretical lower bounds for the variances of the estimations.
A simple oscillator model is proposed to investigate the frequency dependence of cochlear two-tone suppression (2TS), which is the cochlear response to a tone decrease when a second sound is simultaneously presented. The frequency dependence of 2TS exhibits two characteristics. First, the shapes of the input–output (IO) curves as a function of the input levels of the two tones depend on the frequency ratio of the two tones. Second, the temporal features of suppressed responses vary with the frequency ratio. To account for 2TS, the saturation function is widely used; however, it cannot explain the frequency dependence of cochlear 2TS. Transmission line models can reproduce the frequency dependence of cochlear 2TS. It has been suggested that complicated cochlear mechanics generate 2TS in the transmission line model. The model proposed in this study includes a one-degree-of-freedom oscillator and feedback via the saturation function, which produces basic cochlear properties such as amplification, frequency selectivity, and compressive nonlinearity in the model. Simulations show that the transmission line model and our proposed model can reproduce the frequency dependence of 2TS in the IO functions and temporal responses, indicating that the frequency dependence of 2TS requires basic cochlear properties such as amplification, frequency selectivity, and compressive nonlinearity, which are involved in saturating the feedback.
This paper investigates the importance of temporal cues in the perception of speaker individuality and vocal emotion. Experiments of speaker and vocal-emotion recognition were carried out using an analysis/synthesis method of noise-vocoded speech (NVS). The temporal resolution of NVS was controlled by varying the upper limit of modulation frequency (0, 0.5, 1, 2, 4, 8, 16, 32, and 64 Hz). In addition, the role of temporal cue in the different spectral resolution condition was also investigated by varying the number of channels (4, 8, and 16). The results demonstrated that temporal resolution contributes to the recognition of both speaker and vocal emotion. Therefore, temporal cues are found to be important for the perception of not only linguistic information but also speaker individuality and vocal emotion. On the other hand, the performance of speaker recognition was less sensitive to the spectral resolution, at least in the limited set of stimuli in the present study. For vocal-emotion recognition, the spectral resolution was shown to be important for recognizing only neutral, joy, and cold anger, but not sadness or hot anger. The important modulation frequency band for the perception of nonlinguistic information was suggested to be higher than that of linguistic information.
Directivity control using a loudspeaker array is widely studied for various applications. Suppressing sidelobe levels is important for applications such as personal audio systems. In this paper, we propose a filter design method using a window function shape as the desired directivity pattern to reduce the sidelobe levels. The proposed method consists of three steps. The first step defines a cost function with a criterion for the directivity pattern. Next, filter coefficients for each loudspeaker are calculated and stored while changing the window function shape of the desired directivity pattern. Finally, we determine the optimum filter coefficients having the best performance of the cost function by using a full-search algorithm at each frequency. We conducted directivity experiments with a real six-element circular loudspeaker array having a radius of 0.055 m and evaluated its directivity to confirm the performance of the proposed method. The results, which were compared with those obtained from a conventional method, showed that the maximum sidelobe level improved by about 2 dB, although the beam was wide. We verified that using the window function shape as the desired directivity pattern is more effective than using the conventional method for sidelobe suppression.