Acoustical Science and Technology

INVITED REVIEW

A review on subjective and objective evaluation of synthetic speech

Erica Cooper, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Jun ...

2024Volume 45Issue 4 Pages 161-183
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: April 04, 2024

DOIhttps://doi.org/10.1250/ast.e24.12

JOURNAL OPEN ACCESS

Show abstractHide abstract

Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospects. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.

View full abstract

Download PDF (269K)

PAPERS

Three-dimensional numerical investigation on propagation characteristics of road traffic noise from an embankment road: Construction of correction formula for the effect of side slope angle

Yosuke Yasuda, Seiya Nishimura, Yu Kamiya, Makoto Morinaga

2024Volume 45Issue 4 Pages 184-194
Published: July 01, 2024
Released on J-STAGE: July 01, 2024

DOIhttps://doi.org/10.1250/ast.e23.54

JOURNAL OPEN ACCESS

Show abstractHide abstract

Through three-dimensional wave-based numerical analysis, the propagation characteristics of road traffic noise from embankment roads were investigated, focusing on the effects of the side slope angle and height of the embankment. When the prediction plane was set as an orthogonal cross-section in the lane direction, the distribution of A-weighted sound pressure level differences between embankments with a slope and a right-angle wedge showed relatively high values in the region corresponding to the paths where diffracted sound from the edge of the embankment reflects off the ground surface, regardless of the embankment height, slope angle, and position of the prediction plane. When the embankment height and slope angle were the same, the level difference distribution was nearly the same regardless of the position of the sound source and orthogonal prediction plane. Based on these characteristics, an empirical correction formula was proposed, applicable to an arbitrary orthogonal cross-section to the embankment lane direction, representing the effect of the embankment slope angle in terms of the A-weighted sound pressure level difference. By appropriately setting the values of the coefficients required in the formula, both the maximum and average errors in the prediction plane can be reduced without significantly amplifying either of them.

View full abstract

Download PDF (2330K)
Multi-setting acoustic feature training for data augmentation of speech recognition

Sei Ueno, Akinobu Lee

2024Volume 45Issue 4 Pages 195-203
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: February 29, 2024

DOIhttps://doi.org/10.1250/ast.e23.70

JOURNAL OPEN ACCESS

Show abstractHide abstract

This paper presents simple multi-setting log Mel-scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end-to-end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi-setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi-setting lmfb features with its setting ID embedded in the text-to-Mel network. Experimental evaluations showed that both proposed multi-setting training methods achieved better ASR performance than the baseline single-setting training augmentation methods.

View full abstract

Download PDF (582K)
Reexamination of the favorable reverberation time of concert halls measured in a 3D synthesized sound field

Takayuki Hidaka, Noriko Nishihara, Kazunori Suzuki, Takehiko Nakagawa

2024Volume 45Issue 4 Pages 204-215
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: March 23, 2024

DOIhttps://doi.org/10.1250/ast.e23.76

JOURNAL OPEN ACCESS

Show abstractHide abstract

This paper examined the favorable reverberation time in concert halls for orchestral music. Anechoic music sources created by session recording were reproduced by a virtual orchestra with 45 loudspeakers set on concert hall stages and were recorded with a 32-channel spherical microphone at audience seats. Four orchestral music excerpts from the classical, romantic, and contemporary periods were chosen. By a fourth order Ambisonics playback in the laboratory, a series of psychological experiments were conducted. Twenty-one music experts judged the reverberance and clarity of the presented sound. It is found that mid-frequency reverberation time RT_M (octave band average for 500 and 1,000 Hz) and early decay time EDT_M are both highly correlated with the reverberance, and their favorable values are determined by the tempo: speed or pace of the music, not by the chronological classification of music. For the tempo from Presto to Allegro, the favorable RT_M ranges from 1.7 to 2.2 s, and if extrapolation of this result is assumed, the favorable RT_M ranges from 2.0 to 2.2 s from Presto to Andante.

View full abstract

Download PDF (1063K)
The role of acoustic similarity in listening to foreign-accented speech: Recognition of Spanish-accented English words by Japanese native listeners

Sanae Matsui

2024Volume 45Issue 4 Pages 216-223
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: March 16, 2024

DOIhttps://doi.org/10.1250/ast.e23.83

JOURNAL OPEN ACCESS

Show abstractHide abstract

Non-native speakers exhibit distinct speech characteristics from native speakers, referred to as foreign accents. Previous studies have shown that foreign-accented speech can be more easily understood than native speech when the native language of the talker matches that of the listener (e.g., Spanish-accented English perceived by Spanish native speakers) due to acoustic-phonetic similarities between the speech input and the listener's own accent. The present study applied this idea to a case where the native languages of the talker and the listeners differ but where the accents of the talker and listener could share acoustic-phonetic similarities (Spanish-accented English perceived by Japanese native speakers). We examined whether English words with a Spanish accent were recognized more quickly when the stimuli were acoustically closer to the accent of Japanese native listeners than those with Received Pronunciation (RP) were. A word identification experiment was conducted, where Japanese native speakers heard stimuli with RP and a Spanish accent. The results confirmed that the acoustic similarity somewhat facilitated word recognition, even for stimuli with a foreign accent. However, this advantage did not exceed the recognition of stimuli with a native accent. These results suggest a persistent bias towards easier recognition of stimuli produced by native speakers.

View full abstract

Download PDF (260K)

TECHNICAL REPORT

Three-dimensional reconstruction of intrinsic tongue muscles of macaques using diffusion-weighted imaging of flash-frozen specimens

Kanta Nakamura, Naho Konoike, Takeshi Nishimura

2024Volume 45Issue 4 Pages 224-229
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: February 16, 2024

DOIhttps://doi.org/10.1250/ast.e23.85

JOURNAL OPEN ACCESS

Show abstractHide abstract

The tongue plays a major role in speech production. Comparisons of the tongue muscle fiber architecture between humans and nonhuman primates are required to understand the evolutionary acquisition of tongue deformability in human speech. In this study, we performed diffusion-weighted imaging of flash-frozen tongue specimens from macaques, a representative animal model, to visualize the three-dimensional architecture of the intrinsic muscles. The procedures and scanning methods used in this study can also be applied to non-model animals, and are expected to provide quantified data for their tongue architecture to understand the evolutionarily derived features of human tongue deformability.

View full abstract

Download PDF (589K)

ACOUSTICAL LETTERS

Effects of maternal separation on adult vocal communication: A Mongolian gerbil (Meriones unguiculatus) study

Ryo Nishibori, Harutaka Nakagawa, Kazuki Shin'ya, Yuta Tamai, Yuki Ito ...

2024Volume 45Issue 4 Pages 230-233
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: March 07, 2024

DOIhttps://doi.org/10.1250/ast.e24.03

JOURNAL OPEN ACCESS

Show abstractHide abstract

Child neglect increases the risk of developing social communication disorders in adulthood. This study measured how maternal separation affects vocal communication in adult Mongolian gerbils, which produce a rich vocal repertoire during social interactions. The vocalizations produced by the two adult animals when they first met were recorded and analyzed. The results showed that maternal separation (MS)-received gerbils vocalized significantly more, and the effect was more prominent in aggressive vocalizations than in non-aggressive vocalizations. These changes suggest that MS affects social interactions, and demonstrates the potential of the gerbil as a model animal for early stress-related social communication disorders.

View full abstract

Download PDF (424K)
Broadband sound absorption by acoustic metasurface of planar array of small Helmholtz resonators

Shinsuke Nakanishi

2024Volume 45Issue 4 Pages 234-237
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: March 08, 2024

DOIhttps://doi.org/10.1250/ast.e24.11

JOURNAL OPEN ACCESS

Show abstractHide abstract

An acoustic metasurface (AMS) gives a broadband sound absorption by a planar periodic array combining small resonant modules tuned at various frequencies. This study introduces a formulation of sound absorption coefficient of the AMS and discusses the calculated examples by comparing with the measured sound absorption characteristics of the AMS consisted with the planar array of small Helmholtz resonators which have a multiple folded long neck and an airtight cavity.

View full abstract

Download PDF (772K)
Frequency characteristics of amplitude and period fluctuations in synthesized speech including waveform fluctuations made from tuned band noises

Naofumi Aoki

2024Volume 45Issue 4 Pages 238-241
Published: July 01, 2024
Released on J-STAGE: July 01, 2024
Advance online publication: April 02, 2024

DOIhttps://doi.org/10.1250/ast.e24.15

JOURNAL OPEN ACCESS

Show abstractHide abstract

This study has investigated a technique that replicates waveform fluctuations of speech signals by combining partials of limited Q, a parameter defined as the ratio dividing a center frequency of a resonance by its bandwidth. This paper shows that the proposed technique may potentially generate amplitude and period fluctuations of 1/f-like frequency characteristics, which are considered to be one of the indices reflecting the naturalness of human speech.

View full abstract

Download PDF (633K)

Abstracts of Papers in the Journal of the Acoustical Society of Japan (J)

2024Volume 45Issue 4 Pages 243
Published: July 01, 2024
Released on J-STAGE: July 01, 2024

DOIhttps://doi.org/10.1250/ast.e23.904

JOURNAL OPEN ACCESS

Download PDF (80K)

Register with J-STAGE for free!