The sound absorption characteristics obtained with horizontally arranged sound-absorptive strips on walls were evaluated by objective measure based on the acoustical indices determined from the impulse responses calculated by finite-difference time-domain simulation. The subjective effect of the horizontal sound-absorptive strips (HSSs) was also investigated by subjective measure based on Scheffe's paired comparison method. The results of the numerical case study confirmed that the frequency characteristics of the acoustic indices of rooms with the HSSs significantly changed under the influence of the relative positional relationship between the source and receiving points and the arrangement height of the strips. Through a subjective evaluation experiment, the differences in the absorption effect of various types of settings of the strips on the reverberation inside rooms were also clarified.
Auditory feedback has a crucial role in stably controlling speaking and singing. Formant-transformed auditory feedback (TAF) is used to investigate the relationship between perturbation to the formant frequency and the compensatory response to clarify the mechanism of auditory-speech motor control. Although previous studies for formant TAF applied linear predictive coding (LPC) to estimate formant frequencies, LPC estimates false formants for high-pitch voice. In this paper, we investigate how different vocal-tract spectrum estimation methods in real-time formant TAFs affect the compensatory response of formant frequencies to perturbations. A phase equalization-based autoregressive exogenous model (PEAR) is applied to the TAF system as a formant estimation method that can estimate the formant frequency more accurately and robustly than LPC can. Fifteen Japanese native speakers were asked to repeat the Japanese syllables /he/ or /hi/ while receiving feedback sounds whose formants F1 and F2 were transformed. From the results for the /he/ condition, the F1 compensatory response for PEAR was significantly larger than that of LPC, and the compensation error in the F1–F2 plane for PEAR was less than that for LPC. Our results suggest that PEAR can increase both the accuracy of formant frequency estimation and the naturalness of the transformed speech sound.
The intrusion of road traffic noise in scenic areas is one of the key issues in managing acoustic quality. Several studies focused on acceptable sound levels for road traffic noise in such areas; however, most of them estimated acceptable sound levels from the dose-response relationship between sound levels and annoyance or evaluation of acoustic comfort, and few studies investigated acceptable sound levels directly. We directly investigated the acceptable sound levels for road traffic noise in scenic areas in Japan by conducting psycho-acoustic experiments involving a group of participants. Two simulated road traffic noises were used as target sounds, and four audio and video recordings were used as background conditions. By a method of adjustment, the participants were required to adjust the playback level of each target to a maximum acceptable level while comparing the background sound levels. The results showed that the acceptable sound levels cannot be explained by a simple value or a simple signal-to-noise ratio (SNR). There is a clear tendency that a higher SNR, which means that road traffic noise can be heard more clearly, is acceptable in a quieter area. The acceptable sound levels of scenic areas are largely dependent on the evaluators and features of the areas.
Direct aeroacoustic simulations of flow and sound around an instrument with an oscillating reed were performed on the basis of compressible Navier–Stokes equations along with experiments with an artificial blowing device. The measured reed displacement was utilized as forced vibration in the computations. The predicted sound pressure spectrum shows that the level of the fundamental tone almost agrees with the measured result. The numerical results showed that the lowest acoustic mode of clarinet-type reed instruments (one-quarter wavelength mode) was reproduced. Moreover, the sound generation mechanism was discussed in detail using the predicted gradient of mass flow rate in the instrument. It was found that compression and expansion occur inside the mouthpiece, where the flow separation occurs after the spreading of the air jet from the reed channel exit along the inner wall of the mouthpiece. In addition, vortex ring shedding attributable to the acoustic particle velocity around the open end of the instrument was found to occur, causing an expansion wave from the instrument.
The perception of segmental duration is crucial for the distinction of Japanese length contrasts. However, the perceived duration may be changed in a long reverberation, which adds a ``tail'' to sounds, making them perceived as being longer. In addition, since lengthened sounds overlap the following sounds, the boundaries of phonemes would become blurred. In the current study, we investigated whether the effects of reverberation distort the distinction of Japanese length contrasts for native Japanese and English listeners. Stimuli were nonword pairs (/baba/–/babaa/, /ata/–/atta/, and /ama/–/amma/) varying in duration along the continuum. The logistic function was used to model the perception. In the distinction of vowel length contrast in the word-final position, even native listeners identified the stimulus with the shortest vowel duration as a long vowel word with reverberation. Regarding the perception of the geminate nasal, ``geminate'' responses increased with reverberation for native listeners, whereas the results for nonnative listeners indicated that ``singleton'' responses increased with reverberation. It is assumed that the difference could be attributed to the different prototypes of categories of Japanese between native and nonnative listeners. In addition, the results for nonnative listeners might be attributed to the difference in prosody between English and Japanese.
In this paper, we develop two corpora for speech synthesis research. Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we aim at developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In this paper, we construct the JSUT and JVS corpora. They are designed mainly for text-to-speech synthesis and voice conversion, respectively. The JSUT corpus contains 10 hours of reading-style speech uttered by a single speaker, and the JVS corpus contains 30 hours containing three styles of speech uttered by 100 speakers. This paper describes how we designed the corpora and summarizes the specifications. The corpora are available at our project pages.
In recent single-channel speech enhancement, deep neural network (DNN) has played a quite important role for achieving high performance. One standard use of DNN is to construct a mask-generating function for time-frequency (T-F) masking. For applying a mask in T-F domain, the short-time Fourier transform (STFT) is usually utilized because of its well-understood and invertible nature. While the mask-generating regression function has been studied for a long time, there is less research on T-F transform from the viewpoint of speech enhancement. Since the performance of speech enhancement depends on both the T-F mask estimator and T-F transform, investigating T-F transform should be beneficial for designing a better enhancement system. In this paper, as a step toward optimal T-F transform in terms of speech enhancement, we experimentally investigated the effect of parameter settings of STFT on a DNN-based mask estimator. We conducted the experiments using three types of DNN architectures with three types of loss functions, and the results suggested that U-Net is robust to the parameter setting while that is not the case for fully connected and BLSTM networks.