Nonnegative matrix factorization (NMF) is a powerful technique of extracting meaningful patterns from an observed matrix and has been used for many applications in the audio signal processing field. In this article, the principle of NMF and some extensions based on a complex generative model are reviewed. Also, their application to audio source separation is presented.
Recent work has shown that phase information is useful for further improving the performance of speech enhancement, source separation, and speech synthesis. In the speech enhancement field, the combination of amplitude and phase estimations improves the perceived quality more than only amplitude estimation. In this paper, we review two harmonic-structure-based phase estimation methods with temporal and frequency constraints on the harmonic speech phase. In addition, we describe important parameters for phase estimation, such as the frame shift length and window function of the short-time Fourier transform. Subjective experiments using listening tests and future work for phase processing are briefly described.
As importance of the phase of complex spectrogram has been recognized widely, many techniques have been proposed for handling it. However, several definitions and terminologies for the same concept can be found in the literature, which has confused beginners. In this paper, two major definitions of the short-time Fourier transform and their phase conventions are summarized to alleviate such complication. A phase-aware signal-processing scheme based on phase conversion is also introduced with a set of executable MATLAB functions (https://doi.org/10/c3qb).
The angklung is an Indonesian traditional musical instrument made entirely of bamboo. It usually consists of two or three rattle tubes that generate sound by vibrating the tubes. The generated sound is resonated by a rattle resonance tube to make it louder. The rattle tube is carved in a traditional way from a piece of bamboo with a certain length and diameter that are passed from generation to generation to produce the desired tone. In this investigation, we develop a mathematical model of sound generation by a rattle tube and formulate an equation for the frequency of the vibrated rattle tube from its physical and geometrical parameters. Since the rattle tube is not perfectly cylindrical, the frequency of the vibrated rattle tube is derived from the frequency equation for a perfectly cylindrical tube with a modification of the geometrical parameters to make them appropriate for the shape of the rattle tube. This equation can determine the tone frequency for given geometrical parameters of the tube and explain the relationship between the generated tone frequency and the resonant frequency. The model also shows that the discrepancy between the calculated and generated frequencies of the rattle tube is within the response of human ears.
For musical instrument sounds containing partials, which are referred to as modes, the decaying processes of the modes significantly affect the timbre of musical instruments and characterize the sounds. However, their accurate decomposition around the onset is not an easy task, especially when the sounds have sharp onsets and contain the non-modal percussive components such as the attack. This is because the sharp onsets of modes comprise peaky but broad spectra, which makes it difficult to get rid of the attack component. In this paper, an optimization-based method of modal decomposition is proposed to overcome it. The proposed method is formulated as a constrained optimization problem to enforce the perfect reconstruction property which is important for accurate decomposition and causality of modes. Three numerical simulations and application to the real piano sounds confirm the performance of the proposed method.
This study investigated the cognitive biases related to the impression of voice pitch caused by changes in tonal quality. According to the vocal tube model, changing the vocal-tract length (VTL) systematically alters the tonal quality. In one experiment, the fundamental frequency (fo) of the speech samples was raised and lowered on a mel-scale axis. Then the spectral-frequency scale was expanded and contracted to simulate reducing and increasing the VTL. In a second experiment, the width of the fo range was changed in addition to changing the fo height and VTL scaling. Noise-vocoded speech samples were generated to measure the independent effects of the VTL scaling. The participants rated their impressions of the pitch using paired comparison. The results revealed a reversal of the relationship between impression of voice pitch and height of fo when the effects of fo height and VTL scaling on pitch impression were opposite to each other and when the range of the fo contour was equivalent to that of natural speech. VTL scaling played a dominant role in this reversal. However, as the fo contour became flat, this reversal phenomenon disappeared, and the fo height factor came to play the dominant role.
By making use of the extra particle velocity information, an array of vector sensors can achieve better Direction-of-arrival (DOA) estimation performance than a conventional array of pressure sensors. However, it is noted that most of the previous work on DOA estimation with vector-sensor array uses only the time-space statistical information available on the array signals and does not exploit the difference in the time-frequency signatures of the sources. In this paper, we develop a new approach which exploits the inherent time-frequency-space characteristics of the underlying vector-sensor array signal to achieve better DOA estimation performance even in a noisy and coherent environment with few snapshots. It turns out that our approach is based on the spatial time-frequency distributions (STFD) information and can efficiently combine all of the relevant STFD points by the joint approximate diagonalization approach, such as Jacobi rotation, to reduce the effect of noise and achieve the desired angular resolution. Computer simulations with several frequently encountered scenarios, such as multiple closely spaced coherent sources, indicate the superior DOA estimation resolution of our proposed approach as compared with existing techniques. In addition, from a statistical point of view, the performance of our proposed approach is investigated more closely by considering the root mean square error (RMSE) respectively versus SNRs, snapshots, or number of sensors and its excellent performance for higher DOA estimation accuracy is demonstrated.