Acoustical Science and Technology

PAPERS

Three-dimensional acoustic simulation using actual radiation characteristics with finite-difference time-domain method

Shota Okubo, Toshiharu Horiuchi

2025Volume 46Issue 1 Pages 1-10
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: September 21, 2024

DOIhttps://doi.org/10.1250/ast.e24.49

JOURNAL OPEN ACCESS

Show abstractHide abstract

The finite difference time domain (FDTD) method has been proposed and used for sound field simulation. To reproduce actual sound wave propagation in sound field simulations, it is necessary to apply the radiation characteristics. With the FDTD method, radiation characteristics can be applied by setting sound pressure in a dense grid arrangement. However, conventional techniques for capturing radiation characteristics use a sparse array of microphones and are considered insufficient for the FDTD simulation. Furthermore, the technique required to apply captured acoustic signals in a dense grid arrangement with the FDTD method has not been considered. In this paper, we propose a novel hardware and software system that captures the radiation characteristics for a dense grid arrangement and applies them to the FDTD method, while controlling the sound wave propagation with the non-propagation region. The proposed system produces the average differences from measured values of sound pressure, propagation time, center frequency, and log-spectral distortion of 1.8 dB, 0.04 ms, 700 Hz, and 3.5 dB, respectively, which is more accurate than the conventional techniques. The result shows that this system is useful for improving the accuracy of sound wave propagation reproduction with the sound field simulation.

View full abstract

Download PDF (1379K)
Stepwise-based optimizing approaches for arrangements of loudspeaker in multi-zone sound field reproduction

Tong Zhou, Kazuya Yasueda, Ghada Bouattour, Anthimos Georgiadis, Akito ...

2025Volume 46Issue 1 Pages 11-21
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: September 27, 2024

DOIhttps://doi.org/10.1250/ast.e24.56

JOURNAL OPEN ACCESS

Show abstractHide abstract

This study introduces bidirectional stepwise-based algorithms designed to optimize loudspeaker array configurations for Multizone Sound Field Reproduction systems. An initial arrangement selection method based on loudspeaker magnitude enhances the optimization process. These algorithms were validated using the Acoustic Contrast Control and Pressure Matching methods across free-field conditions and a comprehensive Room Impulse Response database including various room conditions. Comparative experiments against traditional unidirectional iterative strategies demonstrate that the proposed algorithms significantly outperform existing methods in terms of efficiency and effectiveness, especially in configurations with fewer loudspeakers. For example, in a small meeting room with 16 loudspeakers, the stepwise-based approaches achieved higher acoustic contrast and required substantially fewer iterations than conventional methods. Specifically, optimization efficiency improvements were about 55.2% and 77.8% in Acoustic Contrast Control and 36.7% and 68.6% in Pressure Matching, compared to conventional iteratively adding or removing approaches.

View full abstract

Download PDF (997K)
Transverse vibrating plate ultrasonic vibration source integrated with horn

Hikaru Miura

2025Volume 46Issue 1 Pages 22-29
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: October 05, 2024

DOIhttps://doi.org/10.1250/ast.e24.61

JOURNAL OPEN ACCESS

Show abstractHide abstract

This paper describes the development of a compact ultrasonic vibration source that has a transverse vibrating plate that can achieve large displacement amplitudes. An ultrasonic vibration source was designed, in which the ultrasonic vibrator excluding the transducer was approximately the same length as the transducer (half the wavelength of the longitudinal vibration). Therefore, the ultrasonic vibrator was integrated with the transverse vibrating plate and the amplitude expansion horn. The design method for integrating the ultrasonic vibration source was clarified, and the vibration characteristics of the vibration source were investigated. The ultrasonic source was used to atomize droplets, demonstrating its practical utility.

View full abstract

Download PDF (1097K)

ACOUSTICAL LETTERS

Acoustical properties of a triple-leaf structure with microperforated panels

Kimihiro Sakagami, Kaito Katayama

2025Volume 46Issue 1 Pages 30-33
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: September 20, 2024

DOIhttps://doi.org/10.1250/ast.e24.60

JOURNAL OPEN ACCESS

Show abstractHide abstract

In this Letter, as a fundamental study on acoustic partitions, the sound absorption and transmission of a triple-leaf structure with two microperforated panels (MPPs) and a nonperforated panel between them are theoretically studied. In this structure, resonant transmission as well as sound absorption occurs owing to the effect of MPPs. This may allow the acoustic properties of this type of partition to be tuned. In this Letter, we provide basic insights into the properties of this type of triple-leaf structure.

View full abstract

Download PDF (2638K)
Improving the harmonic structure of speech spectrum for robust pitch estimation

Husne Ara Chowdhury, Mohammad Shahidur Rahman

2025Volume 46Issue 1 Pages 34-37
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: September 27, 2024

DOIhttps://doi.org/10.1250/ast.e24.69

JOURNAL OPEN ACCESS

Show abstractHide abstract

The harmonic structure of the speech spectrum is crucial for accurate pitch detection. This study presents a method for enhancing the harmonic structure, leading to robust pitch estimation. After analyzing the speech of four males and four females, the results clearly show that the improved harmonic structure ensures pitch estimation independent of traditional issues.

View full abstract

Download PDF (581K)
Coordinate conversions in audio metadata for next-generation audio

Taishi Iwasaki, Hiroki Kubo, Satoshi Oode

2025Volume 46Issue 1 Pages 38-42
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: September 14, 2024

DOIhttps://doi.org/10.1250/ast.e24.77

JOURNAL OPEN ACCESS

Show abstractHide abstract

Positions of audio objects are described using polar or Cartesian coordinates in audio metadata for next-generation audio. The existing coordinate conversion specified in Rec. ITU-R BS.2127 depends on the specific loudspeaker layout. We propose a coordinate conversion method applicable to any loudspeaker layout and conduct a subjective test to verify it.

View full abstract

Download PDF (400K)

—Special Issue on Speech Diversity and Its Applications—

FOREWORD

Special Issue on Speech Diversity and Its Applications

Hiroki Mori

2025Volume 46Issue 1 Pages 43-44
Published: January 01, 2025
Released on J-STAGE: January 01, 2025

DOIhttps://doi.org/10.1250/ast.e25.001

JOURNAL OPEN ACCESS

Download PDF (98K)

INVITED PAPERS

Real-time MRI articulatory movement database and its application to articulatory phonetics

Kikuo Maekawa

2025Volume 46Issue 1 Pages 45-54
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: August 09, 2024

DOIhttps://doi.org/10.1250/ast.e24.22

JOURNAL OPEN ACCESS

Show abstractHide abstract

Real-time MRI video imaging has had a significant impact on articulatory phonetics. Many new findings have been obtained using this technology that enables the objective observation of the whole vocal tract under speech production, which has long been imagined by subjective retrospection. In this paper, I introduce the specifications of the "Real-time MRI Articulatory Movement Database (rtMRIDB)" that my colleagues and I developed and its relevance to the study of diversity in Japanese phonetics. Some ongoing technological developments are also introduced.

View full abstract

Download PDF (926K)
Contributions of audio and visual modalities to perception of Mandarin Chinese emotions in valence-arousal space

Yongwei Li, Aijun Li, Jianhua Tao, Feng Li, Donna Erickson, Masato Aka ...

2025Volume 46Issue 1 Pages 55-63
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: August 24, 2024

DOIhttps://doi.org/10.1250/ast.e24.41

JOURNAL OPEN ACCESS

Show abstractHide abstract

Emotions are usually perceived by multimodal cues for human communications; in recent years, emotions have been studied from the perspective of dimensional approaches. Investigation of audio and video cues to emotion perception in terms of categories of emotion has been relatively extensively conducted, but the contribution of audio and video cues to emotion perception in dimensional space is relatively under-investigated, especially in Mandarin Chinese. In this present study, three psychoacoustic experiments were conducted to investigate the contributions of audio, visual, and audio-visual modalities to emotional perception in the valence and arousal space. Audio-only, video-only, and audio-video modalities were presented to native Chinese subjects with normal hearing and vision for perceptual ratings of emotion in the valence and arousal dimensions. Results suggested that (1) different modalities contribute differently to perceiving valence and arousal dimensions; (2) compared to video-only modality, audio-only modality generally decreases arousal and valence at lower levels, and increases arousal and valence at higher levels; (3) the video-only modality plays an important role in separating anger and happiness emotions in the valence space.

View full abstract

Download PDF (530K)

INVITED REVIEW

Developmental stuttering as a neurodiverse speech style

Koichi Mori

2025Volume 46Issue 1 Pages 64-69
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: June 08, 2024

DOIhttps://doi.org/10.1250/ast.e24.37

JOURNAL OPEN ACCESS

Show abstractHide abstract

The aim of this review is to introduce the concept of neurodiversity as used for developmental stuttering. Since the introduction of the ICF by WHO in 2001, the social model has been introduced into clinical practice. However, it primarily asks the community to be responsible for the accommodation of persons with disabilities (PDs). In addition to the necessity of changes in the legal and legislative environments to conform to the Convention on the Rights of Persons with Disabilities of the United Nations (2006), effective education and advocacy are needed for society to acknowledge and reduce biases of ableism and stigma of disabilities. Ableism is the claim that society is for able-bodied and able-minded people. Ableism remarks and behaviors may impact PDs adversely and are called microaggressions. The diversity movement tries to embrace PDs by removing the border between the able and the disabled. The etiology and characteristics of developmental stuttering are depicted, as well as its neurodiverse and complex nature. The recent advances in the treatment of stuttering without ableism are introduced. Education and advocacy of (neuro)diversity and inclusion in society are still sorely needed for medical and welfare professionals as well as for the general public.

View full abstract

Download PDF (136K)

PAPER

End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal

Hiroki Mori, Hironao Nishino

2025Volume 46Issue 1 Pages 70-77
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: August 07, 2024

DOIhttps://doi.org/10.1250/ast.e24.13

JOURNAL OPEN ACCESS

Show abstractHide abstract

We propose an end-to-end conversational speech synthesis system that allows for flexible control of emotional states defined over emotion dimensions. We extend the Tacotron 2 and VITS architectures to accept emotion dimensions as input. Initially, the model is pre-trained using a large-scale spontaneous speech corpus, followed by fine-tuning using a natural dialogue speech corpus with manually annotated perceived emotion in the form of pleasantness and arousal. Since the pre-training lacks emotion information, we explore two pre-training strategies and demonstrate that applying an emotion dimension estimator before the pre-training enhances emotion controllability. Evaluation of the synthesized speech using VITS yields a mean opinion score of 4 or higher for naturalness. Furthermore, there is a correlation of R=0.53 for pleasantness and R=0.89 for arousal between the given and perceived emotional states. These results underscore the effectiveness of our proposed conversational speech synthesis system with emotion control.

View full abstract

Download PDF (934K)

TECHNICAL REPORTS

Determining the base frequency of the F₀ contour generation model for the diverse expression of speech

Yoshiko Arimoto, Yasuo Horiuchi, Sumio Ohno

2025Volume 46Issue 1 Pages 78-86
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: May 11, 2024

DOIhttps://doi.org/10.1250/ast.e24.05

JOURNAL OPEN ACCESS

Show abstractHide abstract

A reliable method of determining the base frequency (F_b) for utterances of various speaking styles is critical to enabling stable command labeling in the Fujisaki model. To achieve stable command labeling for diverse expressions of speech, a linear fitted model was developed using the ten percentile F₀ of each utterance from three corpora of various speaking styles (read, acted, and spontaneous) as the independent variable to estimate a consistent F_b for each utterance. To assess the robustness of the model for unknown utterances, the model was applied to test data, including both open and corpus-open data not used for the model development, and the difference between the estimated F_b and the trained labelers' annotated F_b was calculated. As a result, the obtained estimation model was found to fit well to the manually labeled F_bs by exhibiting a small root mean squared error (RMSE) of 0.096 and a high coefficient of determination (R²) of 0.89 for the closed dataset. Moreover, the model also exhibited a small RMSE of 0.091 and a high R² of 0.92 for the corpus-open dataset. The results revealed that the proposed model can reliably estimate the F_b of utterances with various speaking styles.

View full abstract

Download PDF (539K)
The influence of semantic primitives in an emotion-mediated willingness to buy model from advertising speech

Mizuki Nagano, Yusuke Ijima, Sadao Hiroya

2025Volume 46Issue 1 Pages 87-95
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: August 01, 2024

DOIhttps://doi.org/10.1250/ast.e24.14

JOURNAL OPEN ACCESS

Show abstractHide abstract

The retail industries strive to enhance the willingness to buy through various elements, such as store environment, layout, and advertising. Speech is one of the most effective methods used in advertising, particularly in broadcast advertising. Our previous study indicated that the stimulus-organism-response (SOR) theory, using emotional states, can partially explain the effect of advertising speech on the willingness to buy. It suggests that emotional states alone are not sufficient to explain this effect. In this study, we conducted an experiment to determine whether adding semantic primitives to the emotion-mediated SOR model could completely mediate the impact of advertising speech on the willingness to buy. During the study, participants listened to speech with modified features (mean fundamental frequency (F0), speech rate, or standard deviation of F0) and rated their willingness to buy the advertised products, as well as their own emotions and semantic primitives. We found that adding semantic primitives as a mediator can completely mediate the willingness to buy from the standard deviation of F0 in the advertising speech. These results will be useful for developing speech synthesis methods aimed at increasing people's willingness to buy.

View full abstract

Download PDF (311K)

ACOUSTICAL LETTERS

We open our mouths when we are silent

Shoki Kawanishi, Yuya Chiba, Akinori Ito, Takashi Nose

2025Volume 46Issue 1 Pages 96-99
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: September 25, 2024

DOIhttps://doi.org/10.1250/ast.e24.21

JOURNAL OPEN ACCESS

Show abstractHide abstract

Lip syncing is an important technology that enhances the impression of embodied conversational agents. However, there is no study to design the mouth movement of the agent when the agent is silent. Therefore, this paper investigated how human speakers move their mouths when silent in dialogues. As a result, we found three facts. First, a speaker does not completely close mouth even when listening to a partner's talk. Second, the degree of mouth opening while talking and listening greatly depends on the speaker. Third, the mouth opening is possibly affected by the next state of the speaker.

View full abstract

Download PDF (576K)
Taut-MUSHRA: A MUSHRA-based method without hidden reference and anchors for relative sound quality evaluation

Fumiyoshi Matano, Yuya Tagusari, Takanori Horibe, Junya Koguchi, Masan ...

2025Volume 46Issue 1 Pages 100-102
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: July 05, 2024

DOIhttps://doi.org/10.1250/ast.e24.34

JOURNAL OPEN ACCESS

Show abstractHide abstract

State-of-the-art text-to-speech systems have improved in sound quality and have become increasingly large in terms of the number of subjects to detect differences in MOS evaluation, which uses the five-scale precision. The MUSHRA method can precisely detect differences in sound quality compared with the MOS method because sound qualities are rated on a relative scale of 0 to 100 on 101 scales. However, it has the drawback of requiring hidden reference and anchors; thus, it cannot detect cases exceeding the hidden reference. Our method, named Taut-MUSHRA, requires no hidden reference and anchors and instead adds two constraints to the subjects. As a result, compared with the MOS method, our Taut-MUSHRA method could more sensitively detect differences in sound quality.

View full abstract

Download PDF (246K)
Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis

Hiroki Mori, Kota Furukawa

2025Volume 46Issue 1 Pages 103-105
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: August 23, 2024

DOIhttps://doi.org/10.1250/ast.e24.35

JOURNAL OPEN ACCESS

Show abstractHide abstract

In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.

View full abstract

Download PDF (240K)
A prosthesis for speech sound disorders

Takayuki Arai

2025Volume 46Issue 1 Pages 106-110
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: July 05, 2024

DOIhttps://doi.org/10.1250/ast.e24.40

JOURNAL OPEN ACCESS

Show abstractHide abstract

We have developed a prosthetic device for speech sound disorders based on our earlier vocal-tract model. The proposed device mainly consists of a mouth piece, lip plates, and imitation tongue. We first estimated the vocal-tract area functions, particularly when the tongue is at the resting position and when it is raised up. We then tested the output sounds produced by a human speaker using the device with different configurations of the imitation tongue and open/close gestures of the lip plate. The results showed that, while the prosthetic device produced sounds of only moderate quality, the phrases became more intelligible.

View full abstract

Download PDF (626K)
Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education

Hideki Kawahara, Masanori Morise

2025Volume 46Issue 1 Pages 111-115
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: June 13, 2024

DOIhttps://doi.org/10.1250/ast.e24.43

JOURNAL OPEN ACCESS

Show abstractHide abstract

We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.

View full abstract

Download PDF (738K)
Fast end-to-end non-parallel voice conversion based on speaker-adaptive neural vocoder with cycle-consistent learning

Shuhei Imai, Aoi Kanagaki, Takashi Nose, Shogo Fukawa, Akinori Ito

2025Volume 46Issue 1 Pages 116-119
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: August 23, 2024

DOIhttps://doi.org/10.1250/ast.e24.46

JOURNAL OPEN ACCESS

Show abstractHide abstract

This paper proposes a fast end-to-end non-parallel voice conversion (VC) named Tachylone. In Thachylone, speaker conversion and waveform generation is performed by a single vocoder network. In the training of Tachylone, a pre-trained universal neural vocoder is used as the initial model, and the model parameters are updated using source and target speakers' non-parallel data based on cycle-consistent learning in an end-to-end manner. We compare Tachylone to conventional CycleGAN-based VC with objective and subjective measures and discuss the results.

View full abstract

Download PDF (418K)
Unified model for voice conversion of speech and singing voice using adaptive pitch constraints

Shogo Fukawa, Takashi Nose, Shuhei Imai, Akinori Ito

2025Volume 46Issue 1 Pages 120-123
Published: January 01, 2025
Released on J-STAGE: January 01, 2025
Advance online publication: July 26, 2024

DOIhttps://doi.org/10.1250/ast.e24.47

JOURNAL OPEN ACCESS

Show abstractHide abstract

This paper proposes a voice conversion named SpSiVC that appropriately converts both speech and singing voices with a single model. Since the distribution of pitch between speakers is significantly different for speech and singing voices, voice conversion has been mainly evaluated as a separate task for speech and singing voice conversion. SpSiVC introduces an adaptive F0 loss, which enables conversion that implicitly switches the shift width of the logarithm F0 according to the type of input voice. We examine the effectiveness of the F0 constraints in objective and subjective evaluations.

View full abstract

Download PDF (239K)

Register with J-STAGE for free!