Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Volume 45, Issue 4
Displaying 1-10 of 10 articles from this issue
INVITED REVIEW
  • Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Jun ...
    2024 Volume 45 Issue 4 Pages 161-183
    Published: July 01, 2024
    Released on J-STAGE: July 01, 2024
    Advance online publication: April 04, 2024
    JOURNAL OPEN ACCESS

    Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospects. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.

    Download PDF (269K)
PAPERS
  • Yosuke Yasuda, Seiya Nishimura, Yu Kamiya, Makoto Morinaga
    2024 Volume 45 Issue 4 Pages 184-194
    Published: July 01, 2024
    Released on J-STAGE: July 01, 2024
    JOURNAL OPEN ACCESS

    Through three-dimensional wave-based numerical analysis, the propagation characteristics of road traffic noise from embankment roads were investigated, focusing on the effects of the side slope angle and height of the embankment. When the prediction plane was set as an orthogonal cross-section in the lane direction, the distribution of A-weighted sound pressure level differences between embankments with a slope and a right-angle wedge showed relatively high values in the region corresponding to the paths where diffracted sound from the edge of the embankment reflects off the ground surface, regardless of the embankment height, slope angle, and position of the prediction plane. When the embankment height and slope angle were the same, the level difference distribution was nearly the same regardless of the position of the sound source and orthogonal prediction plane. Based on these characteristics, an empirical correction formula was proposed, applicable to an arbitrary orthogonal cross-section to the embankment lane direction, representing the effect of the embankment slope angle in terms of the A-weighted sound pressure level difference. By appropriately setting the values of the coefficients required in the formula, both the maximum and average errors in the prediction plane can be reduced without significantly amplifying either of them.

    Download PDF (2330K)
  • Sei Ueno, Akinobu Lee
    2024 Volume 45 Issue 4 Pages 195-203
    Published: July 01, 2024
    Released on J-STAGE: July 01, 2024
    Advance online publication: February 29, 2024
    JOURNAL OPEN ACCESS

    This paper presents simple multi-setting log Mel-scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end-to-end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi-setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi-setting lmfb features with its setting ID embedded in the text-to-Mel network. Experimental evaluations showed that both proposed multi-setting training methods achieved better ASR performance than the baseline single-setting training augmentation methods.

    Download PDF (582K)
  • Takayuki Hidaka, Noriko Nishihara, Kazunori Suzuki, Takehiko Nakagawa
    2024 Volume 45 Issue 4 Pages 204-215
    Published: July 01, 2024
    Released on J-STAGE: July 01, 2024
    Advance online publication: March 23, 2024
    JOURNAL OPEN ACCESS

    This paper examined the favorable reverberation time in concert halls for orchestral music. Anechoic music sources created by session recording were reproduced by a virtual orchestra with 45 loudspeakers set on concert hall stages and were recorded with a 32-channel spherical microphone at audience seats. Four orchestral music excerpts from the classical, romantic, and contemporary periods were chosen. By a fourth order Ambisonics playback in the laboratory, a series of psychological experiments were conducted. Twenty-one music experts judged the reverberance and clarity of the presented sound. It is found that mid-frequency reverberation time RTM (octave band average for 500 and 1,000 Hz) and early decay time EDTM are both highly correlated with the reverberance, and their favorable values are determined by the tempo: speed or pace of the music, not by the chronological classification of music. For the tempo from Presto to Allegro, the favorable RTM ranges from 1.7 to 2.2 s, and if extrapolation of this result is assumed, the favorable RTM ranges from 2.0 to 2.2 s from Presto to Andante.

    Download PDF (1063K)
  • Sanae Matsui
    2024 Volume 45 Issue 4 Pages 216-223
    Published: July 01, 2024
    Released on J-STAGE: July 01, 2024
    Advance online publication: March 16, 2024
    JOURNAL OPEN ACCESS

    Non-native speakers exhibit distinct speech characteristics from native speakers, referred to as foreign accents. Previous studies have shown that foreign-accented speech can be more easily understood than native speech when the native language of the talker matches that of the listener (e.g., Spanish-accented English perceived by Spanish native speakers) due to acoustic-phonetic similarities between the speech input and the listener's own accent. The present study applied this idea to a case where the native languages of the talker and the listeners differ but where the accents of the talker and listener could share acoustic-phonetic similarities (Spanish-accented English perceived by Japanese native speakers). We examined whether English words with a Spanish accent were recognized more quickly when the stimuli were acoustically closer to the accent of Japanese native listeners than those with Received Pronunciation (RP) were. A word identification experiment was conducted, where Japanese native speakers heard stimuli with RP and a Spanish accent. The results confirmed that the acoustic similarity somewhat facilitated word recognition, even for stimuli with a foreign accent. However, this advantage did not exceed the recognition of stimuli with a native accent. These results suggest a persistent bias towards easier recognition of stimuli produced by native speakers.

    Download PDF (260K)
TECHNICAL REPORT
  • Kanta Nakamura, Naho Konoike, Takeshi Nishimura
    2024 Volume 45 Issue 4 Pages 224-229
    Published: July 01, 2024
    Released on J-STAGE: July 01, 2024
    Advance online publication: February 16, 2024
    JOURNAL OPEN ACCESS

    The tongue plays a major role in speech production. Comparisons of the tongue muscle fiber architecture between humans and nonhuman primates are required to understand the evolutionary acquisition of tongue deformability in human speech. In this study, we performed diffusion-weighted imaging of flash-frozen tongue specimens from macaques, a representative animal model, to visualize the three-dimensional architecture of the intrinsic muscles. The procedures and scanning methods used in this study can also be applied to non-model animals, and are expected to provide quantified data for their tongue architecture to understand the evolutionarily derived features of human tongue deformability.

    Download PDF (589K)
ACOUSTICAL LETTERS
feedback
Top