Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Advance online publication
Displaying 1-12 of 12 articles from this issue
  • Ryohei Suzuki, Kanae Amino, Takayuki Arai
    Article ID: e24.07
    Published: 2024
    Advance online publication: April 26, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    Human speaker recognition performance can be degraded by various factors. Understanding the factors affecting it and the errors caused by these factors is crucial for forensic applications. To study the effects of noisy environments on human speaker recognition, we conducted a hearing experiment using speech samples of two words by five male speakers, and two noise types (speech-like noise and environmental noise in boiler room) with three steps of signal-to-noise ratio (∞, 0 dB, or −10 dB). The results suggested that the listeners tended to observe different speakers to be the same speaker rather than vice versa, and this tendeny was also affected by sex of the listener.

    Download PDF (963K)
  • Irwansyah, Sho Otsuka, Seiji Nakagawa
    Article ID: e24.10
    Published: 2024
    Advance online publication: April 19, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    This study explores the impact of pinna hardness and vibrator placement on the efficacy of bone conduction through the pinna. Hearing thresholds of twelve participants, all without abnormal pinna conditions, were assessed across frequencies ranging from 250 Hz to 8 kHz, with vibrators positioned at three distinct locations—the front of the ear canal, the earlobe, and behind the cymba concha. Additionally, with a focus on consistent variable manipulation in a controlled experimental scenario, four silicone ear models with Shore hardness values from 0A to 45A were utilized to examine vibrational energy transmission via an accelerometer fixed behind the ear canal. The results indicated that vibrator placement significantly influenced hearing thresholds, a pattern that was also observed in the silicone models. However, the anticipated correlation between pinna hardness and hearing thresholds was not significant within the human sample. This could be attributed to less variability in natural pinna hardness than expected. While it is recognized that pinna hardness varies among individuals, our study reveals a less dramatic variation in pinna hardness among individuals, suggesting that its influence on bone conduction may be less critical than other anatomical factors.

    Download PDF (3893K)
  • Yuki Ishizaka, Sho Otsuka, Seiji Nakagawa
    Article ID: e24.28
    Published: 2024
    Advance online publication: April 19, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    The medial olivocochlear reflex (MOCR) is reported to be modulated by the predictability of an upcoming sound occurrence. Here the relationship between MOCR and internal confidence in temporal anticipation evaluated by reaction time (RT) was examined. The timing predictability of the MOCR elicitor was manipulated by adding jitters to preceding sounds. MOCR strength/RT unchanged in a small (10%) jitter condition, and decrease/increase significantly in the largest (40%) jitter condition compared to the without-jitter condition. The similarity indicates that the MOCR strength reflects confidence in anticipation, and that the predictive control of MOCR and response execution share a common neural mechanism.

    Download PDF (449K)
  • Leo Misono, Kenji Muto
    Article ID: e24.19
    Published: 2024
    Advance online publication: April 13, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    Cicadas sound loudly and interfere with traffic noise measurements. The frequency characteristics of some outdoors cicada sounds have been reported, but the background noise and distance to the cicada have not been considered. The aim of this work was to accurately measure the frequency characteristics of the A-weighted sound pressure level of each robust cicada sound. The frequency characteristics of the /mi/ and /n/ sounds were measured in a free field. The dominant frequencies were 4.7 kHz for the /mi/ sound and 15 kHz for the /n/ sound, and the distributions of the peak frequencies for their sounds were normal.

    Download PDF (20064K)
  • Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Jun ...
    Article ID: e24.12
    Published: 2024
    Advance online publication: April 04, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospective. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.

    Download PDF (455K)
  • Naofumi Aoki
    Article ID: e24.15
    Published: 2024
    Advance online publication: April 02, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    This study has investigated a technique that replicates waveform fluctuations of speech signals by combining partials of limited Q, a parameter defined as the ratio dividing a center frequency of a resonance by its bandwidth. This paper shows that the proposed technique may potentially generate amplitude and period fluctuations of 1/f-like frequency characteristics, which are considered to be one of the indices reflecting the naturalness of human speech.

    Download PDF (512K)
  • Takayuki Hidaka, Noriko Nishihara, Kazunori Suzuki, Takehiko Nakagawa
    Article ID: e23.76
    Published: 2024
    Advance online publication: March 23, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    This paper examined the favorable reverberation time in concert halls for orchestral music. Anechoic music sources created by session recording were reproduced by a virtual orchestra with 45 loudspeakers set on concert hall stages and were recorded with a 32-channel spherical microphone at audience seats. Four orchestral music excerpts from the classical, romantic, and contemporary periods were chosen. By a fourth order Ambisonics playback in the laboratory, a series of psychological experiments were conducted. Twenty-one music experts judged the reverberance and clarity of the presented sound. It is found that mid-frequency reverberation time RTM (octave band average for 500 and 1000 Hz) and early decay time EDTM are both highly correlated with the reverberance, and their favorable values are determined by the tempo: speed or pace of the music, not by the chronological classification of music. For the tempo from Presto to Allegro, the favorable RTM ranges from 1.7 to 2.2 s, and if extrapolation of this result is assumed, the favorable RTM ranges from 2.0 to 2.2 s from Presto to Andante.

    Download PDF (737K)
  • Ryo Nishibori, Harutaka Nakagawa, Kazuki Shin'ya, Yuta Tamai, Yuki Ito ...
    Article ID: e24.03
    Published: 2024
    Advance online publication: March 07, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    Child neglect increases the risk of developing social communication disorders in adulthood. This study measured how maternal separation affects vocal communication in adult Mongolian gerbils, which produce a rich vocal repertoire during social interactions. The vocalizations produced by the two adult animals when they first met were recorded and analyzed. The results showed that MS-received gerbils vocalized significantly more, and the effect was more prominent in aggressive vocalizations than in non-aggressive vocalizations. These changes suggest that MS affects social interactions, and demonstrates the potential of the gerbil as a model animal for early stress-related social communication disorders.

    Download PDF (489K)
  • Sanae Matsui
    Article ID: e23.83
    Published: 2024
    Advance online publication: March 16, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    Non-native speakers exhibit distinct speech characteristics from native speakers, referred to as foreign accents. Previous studies have shown that foreign-accented speech can be more easily understood than native speech when the native language of the talker matches that of the listener (e.g., Spanish-accented English perceived by Spanish native speakers) due to acoustic-phonetic similarities between the speech input and the listener’s own accent. The present study applied this idea to a case where the native languages of the talker and the listeners differ but where the accents of the talker and listener could share acoustic-phonetic similarities (Spanish-accented English perceived by Japanese native speakers). We examined whether English words with a Spanish accent were recognized more quickly when the stimuli were acoustically closer to the accent of Japanese native listeners than those with Received Pronunciation were. A word identification experiment was conducted, where Japanese native speakers heard stimuli with Received Pronunciation and a Spanish accent. The results confirmed that the acoustic similarity somewhat facilitated word recognition, even for stimuli with a foreign accent. However, this advantage did not exceed the recognition of stimuli with a native accent. These results suggest a persistent bias towards easier recognition of stimuli produced by native speakers.

    Download PDF (1250K)
  • Shinsuke Nakanishi
    Article ID: e24.11
    Published: 2024
    Advance online publication: March 08, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    An acoustic metasurface (AMS) gives a broadband sound absorption by a planar periodic array combining small resonant modules tuned at various frequencies. This study introduces a formulation of sound absorption coefficient of the AMS and discusses the calculated examples by comparing with the measured sound absorption characteristics of the AMS consisted with the planar array of small Helmholtz resonators which have a multiple folded long neck and an airtight cavity.

    Download PDF (1263K)
  • Sei Ueno, Akinobu Lee
    Article ID: e23.70
    Published: 2024
    Advance online publication: February 29, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    This paper presents simple multi-setting log Mel-scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end-to-end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi-setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi-setting lmfb features with its setting ID embedded in the text-to-Mel network. Experimental evaluations showed that both proposed multi-setting training methods achieved better ASR performance than the baseline single-setting training augmentation methods.

    Download PDF (1802K)
  • Kanta Nakamura, Naho Konoike, Takeshi Nishimura
    Article ID: e23.85
    Published: 2024
    Advance online publication: February 16, 2024
    JOURNAL OPEN ACCESS ADVANCE PUBLICATION

    The tongue plays a major role in speech production. Comparisons of the tongue muscle fiber architecture between humans and nonhuman primates are required to understand the evolutionary acquisition of tongue deformability in human speech. In this study, we performed diffusion-weighted imaging of flash-frozen tongue specimens from macaques, a representative animal model, to visualize the three-dimensional architecture of the intrinsic muscles. The procedures and scanning methods used in this study can also be applied to non-model animals, and are expected to provide quantified data for their tongue architecture to understand the evolutionarily derived features of human tongue deformability.

    Download PDF (2387K)
feedback
Top