Acoustical Science and Technology

PAPERS

Three-dimensional acoustic simulation using actual radiation characteristics with finite-difference time-domain method

Shota Okubo, Toshiharu Horiuchi

2025 年46 巻1 号 p. 1-10
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/09/21

DOIhttps://doi.org/10.1250/ast.e24.49

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

The finite difference time domain (FDTD) method has been proposed and used for sound field simulation. To reproduce actual sound wave propagation in sound field simulations, it is necessary to apply the radiation characteristics. With the FDTD method, radiation characteristics can be applied by setting sound pressure in a dense grid arrangement. However, conventional techniques for capturing radiation characteristics use a sparse array of microphones and are considered insufficient for the FDTD simulation. Furthermore, the technique required to apply captured acoustic signals in a dense grid arrangement with the FDTD method has not been considered. In this paper, we propose a novel hardware and software system that captures the radiation characteristics for a dense grid arrangement and applies them to the FDTD method, while controlling the sound wave propagation with the non-propagation region. The proposed system produces the average differences from measured values of sound pressure, propagation time, center frequency, and log-spectral distortion of 1.8 dB, 0.04 ms, 700 Hz, and 3.5 dB, respectively, which is more accurate than the conventional techniques. The result shows that this system is useful for improving the accuracy of sound wave propagation reproduction with the sound field simulation.

抄録全体を表示

PDF形式でダウンロード (1379K)
Stepwise-based optimizing approaches for arrangements of loudspeaker in multi-zone sound field reproduction

Tong Zhou, Kazuya Yasueda, Ghada Bouattour, Anthimos Georgiadis, Akito ...

2025 年46 巻1 号 p. 11-21
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/09/27

DOIhttps://doi.org/10.1250/ast.e24.56

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

This study introduces bidirectional stepwise-based algorithms designed to optimize loudspeaker array configurations for Multizone Sound Field Reproduction systems. An initial arrangement selection method based on loudspeaker magnitude enhances the optimization process. These algorithms were validated using the Acoustic Contrast Control and Pressure Matching methods across free-field conditions and a comprehensive Room Impulse Response database including various room conditions. Comparative experiments against traditional unidirectional iterative strategies demonstrate that the proposed algorithms significantly outperform existing methods in terms of efficiency and effectiveness, especially in configurations with fewer loudspeakers. For example, in a small meeting room with 16 loudspeakers, the stepwise-based approaches achieved higher acoustic contrast and required substantially fewer iterations than conventional methods. Specifically, optimization efficiency improvements were about 55.2% and 77.8% in Acoustic Contrast Control and 36.7% and 68.6% in Pressure Matching, compared to conventional iteratively adding or removing approaches.

抄録全体を表示

PDF形式でダウンロード (997K)
Transverse vibrating plate ultrasonic vibration source integrated with horn

Hikaru Miura

2025 年46 巻1 号 p. 22-29
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/10/05

DOIhttps://doi.org/10.1250/ast.e24.61

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

This paper describes the development of a compact ultrasonic vibration source that has a transverse vibrating plate that can achieve large displacement amplitudes. An ultrasonic vibration source was designed, in which the ultrasonic vibrator excluding the transducer was approximately the same length as the transducer (half the wavelength of the longitudinal vibration). Therefore, the ultrasonic vibrator was integrated with the transverse vibrating plate and the amplitude expansion horn. The design method for integrating the ultrasonic vibration source was clarified, and the vibration characteristics of the vibration source were investigated. The ultrasonic source was used to atomize droplets, demonstrating its practical utility.

抄録全体を表示

PDF形式でダウンロード (1097K)

ACOUSTICAL LETTERS

Acoustical properties of a triple-leaf structure with microperforated panels

Kimihiro Sakagami, Kaito Katayama

2025 年46 巻1 号 p. 30-33
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/09/20

DOIhttps://doi.org/10.1250/ast.e24.60

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

In this Letter, as a fundamental study on acoustic partitions, the sound absorption and transmission of a triple-leaf structure with two microperforated panels (MPPs) and a nonperforated panel between them are theoretically studied. In this structure, resonant transmission as well as sound absorption occurs owing to the effect of MPPs. This may allow the acoustic properties of this type of partition to be tuned. In this Letter, we provide basic insights into the properties of this type of triple-leaf structure.

抄録全体を表示

PDF形式でダウンロード (2638K)
Improving the harmonic structure of speech spectrum for robust pitch estimation

Husne Ara Chowdhury, Mohammad Shahidur Rahman

2025 年46 巻1 号 p. 34-37
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/09/27

DOIhttps://doi.org/10.1250/ast.e24.69

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

The harmonic structure of the speech spectrum is crucial for accurate pitch detection. This study presents a method for enhancing the harmonic structure, leading to robust pitch estimation. After analyzing the speech of four males and four females, the results clearly show that the improved harmonic structure ensures pitch estimation independent of traditional issues.

抄録全体を表示

PDF形式でダウンロード (581K)
Coordinate conversions in audio metadata for next-generation audio

Taishi Iwasaki, Hiroki Kubo, Satoshi Oode

2025 年46 巻1 号 p. 38-42
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/09/14

DOIhttps://doi.org/10.1250/ast.e24.77

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

Positions of audio objects are described using polar or Cartesian coordinates in audio metadata for next-generation audio. The existing coordinate conversion specified in Rec. ITU-R BS.2127 depends on the specific loudspeaker layout. We propose a coordinate conversion method applicable to any loudspeaker layout and conduct a subjective test to verify it.

抄録全体を表示

PDF形式でダウンロード (400K)

—Special Issue on Speech Diversity and Its Applications—

FOREWORD

Special Issue on Speech Diversity and Its Applications

Hiroki Mori

2025 年46 巻1 号 p. 43-44
発行日: 2025/01/01
公開日: 2025/01/01

DOIhttps://doi.org/10.1250/ast.e25.001

ジャーナルオープンアクセス

PDF形式でダウンロード (98K)

INVITED PAPERS

Real-time MRI articulatory movement database and its application to articulatory phonetics

Kikuo Maekawa

2025 年46 巻1 号 p. 45-54
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/08/09

DOIhttps://doi.org/10.1250/ast.e24.22

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

Real-time MRI video imaging has had a significant impact on articulatory phonetics. Many new findings have been obtained using this technology that enables the objective observation of the whole vocal tract under speech production, which has long been imagined by subjective retrospection. In this paper, I introduce the specifications of the "Real-time MRI Articulatory Movement Database (rtMRIDB)" that my colleagues and I developed and its relevance to the study of diversity in Japanese phonetics. Some ongoing technological developments are also introduced.

抄録全体を表示

PDF形式でダウンロード (926K)
Contributions of audio and visual modalities to perception of Mandarin Chinese emotions in valence-arousal space

Yongwei Li, Aijun Li, Jianhua Tao, Feng Li, Donna Erickson, Masato Aka ...

2025 年46 巻1 号 p. 55-63
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/08/24

DOIhttps://doi.org/10.1250/ast.e24.41

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

Emotions are usually perceived by multimodal cues for human communications; in recent years, emotions have been studied from the perspective of dimensional approaches. Investigation of audio and video cues to emotion perception in terms of categories of emotion has been relatively extensively conducted, but the contribution of audio and video cues to emotion perception in dimensional space is relatively under-investigated, especially in Mandarin Chinese. In this present study, three psychoacoustic experiments were conducted to investigate the contributions of audio, visual, and audio-visual modalities to emotional perception in the valence and arousal space. Audio-only, video-only, and audio-video modalities were presented to native Chinese subjects with normal hearing and vision for perceptual ratings of emotion in the valence and arousal dimensions. Results suggested that (1) different modalities contribute differently to perceiving valence and arousal dimensions; (2) compared to video-only modality, audio-only modality generally decreases arousal and valence at lower levels, and increases arousal and valence at higher levels; (3) the video-only modality plays an important role in separating anger and happiness emotions in the valence space.

抄録全体を表示

PDF形式でダウンロード (530K)

INVITED REVIEW

Developmental stuttering as a neurodiverse speech style

Koichi Mori

2025 年46 巻1 号 p. 64-69
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/06/08

DOIhttps://doi.org/10.1250/ast.e24.37

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

The aim of this review is to introduce the concept of neurodiversity as used for developmental stuttering. Since the introduction of the ICF by WHO in 2001, the social model has been introduced into clinical practice. However, it primarily asks the community to be responsible for the accommodation of persons with disabilities (PDs). In addition to the necessity of changes in the legal and legislative environments to conform to the Convention on the Rights of Persons with Disabilities of the United Nations (2006), effective education and advocacy are needed for society to acknowledge and reduce biases of ableism and stigma of disabilities. Ableism is the claim that society is for able-bodied and able-minded people. Ableism remarks and behaviors may impact PDs adversely and are called microaggressions. The diversity movement tries to embrace PDs by removing the border between the able and the disabled. The etiology and characteristics of developmental stuttering are depicted, as well as its neurodiverse and complex nature. The recent advances in the treatment of stuttering without ableism are introduced. Education and advocacy of (neuro)diversity and inclusion in society are still sorely needed for medical and welfare professionals as well as for the general public.

抄録全体を表示

PDF形式でダウンロード (136K)

PAPER

End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal

Hiroki Mori, Hironao Nishino

2025 年46 巻1 号 p. 70-77
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/08/07

DOIhttps://doi.org/10.1250/ast.e24.13

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

We propose an end-to-end conversational speech synthesis system that allows for flexible control of emotional states defined over emotion dimensions. We extend the Tacotron 2 and VITS architectures to accept emotion dimensions as input. Initially, the model is pre-trained using a large-scale spontaneous speech corpus, followed by fine-tuning using a natural dialogue speech corpus with manually annotated perceived emotion in the form of pleasantness and arousal. Since the pre-training lacks emotion information, we explore two pre-training strategies and demonstrate that applying an emotion dimension estimator before the pre-training enhances emotion controllability. Evaluation of the synthesized speech using VITS yields a mean opinion score of 4 or higher for naturalness. Furthermore, there is a correlation of R=0.53 for pleasantness and R=0.89 for arousal between the given and perceived emotional states. These results underscore the effectiveness of our proposed conversational speech synthesis system with emotion control.

抄録全体を表示

PDF形式でダウンロード (934K)

TECHNICAL REPORTS

Determining the base frequency of the F₀ contour generation model for the diverse expression of speech

Yoshiko Arimoto, Yasuo Horiuchi, Sumio Ohno

2025 年46 巻1 号 p. 78-86
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/05/11

DOIhttps://doi.org/10.1250/ast.e24.05

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

A reliable method of determining the base frequency (F_b) for utterances of various speaking styles is critical to enabling stable command labeling in the Fujisaki model. To achieve stable command labeling for diverse expressions of speech, a linear fitted model was developed using the ten percentile F₀ of each utterance from three corpora of various speaking styles (read, acted, and spontaneous) as the independent variable to estimate a consistent F_b for each utterance. To assess the robustness of the model for unknown utterances, the model was applied to test data, including both open and corpus-open data not used for the model development, and the difference between the estimated F_b and the trained labelers' annotated F_b was calculated. As a result, the obtained estimation model was found to fit well to the manually labeled F_bs by exhibiting a small root mean squared error (RMSE) of 0.096 and a high coefficient of determination (R²) of 0.89 for the closed dataset. Moreover, the model also exhibited a small RMSE of 0.091 and a high R² of 0.92 for the corpus-open dataset. The results revealed that the proposed model can reliably estimate the F_b of utterances with various speaking styles.

抄録全体を表示

PDF形式でダウンロード (539K)
The influence of semantic primitives in an emotion-mediated willingness to buy model from advertising speech

Mizuki Nagano, Yusuke Ijima, Sadao Hiroya

2025 年46 巻1 号 p. 87-95
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/08/01

DOIhttps://doi.org/10.1250/ast.e24.14

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

The retail industries strive to enhance the willingness to buy through various elements, such as store environment, layout, and advertising. Speech is one of the most effective methods used in advertising, particularly in broadcast advertising. Our previous study indicated that the stimulus-organism-response (SOR) theory, using emotional states, can partially explain the effect of advertising speech on the willingness to buy. It suggests that emotional states alone are not sufficient to explain this effect. In this study, we conducted an experiment to determine whether adding semantic primitives to the emotion-mediated SOR model could completely mediate the impact of advertising speech on the willingness to buy. During the study, participants listened to speech with modified features (mean fundamental frequency (F0), speech rate, or standard deviation of F0) and rated their willingness to buy the advertised products, as well as their own emotions and semantic primitives. We found that adding semantic primitives as a mediator can completely mediate the willingness to buy from the standard deviation of F0 in the advertising speech. These results will be useful for developing speech synthesis methods aimed at increasing people's willingness to buy.

抄録全体を表示

PDF形式でダウンロード (311K)

ACOUSTICAL LETTERS

We open our mouths when we are silent

Shoki Kawanishi, Yuya Chiba, Akinori Ito, Takashi Nose

2025 年46 巻1 号 p. 96-99
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/09/25

DOIhttps://doi.org/10.1250/ast.e24.21

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

Lip syncing is an important technology that enhances the impression of embodied conversational agents. However, there is no study to design the mouth movement of the agent when the agent is silent. Therefore, this paper investigated how human speakers move their mouths when silent in dialogues. As a result, we found three facts. First, a speaker does not completely close mouth even when listening to a partner's talk. Second, the degree of mouth opening while talking and listening greatly depends on the speaker. Third, the mouth opening is possibly affected by the next state of the speaker.

抄録全体を表示

PDF形式でダウンロード (576K)
Taut-MUSHRA: A MUSHRA-based method without hidden reference and anchors for relative sound quality evaluation

Fumiyoshi Matano, Yuya Tagusari, Takanori Horibe, Junya Koguchi, Masan ...

2025 年46 巻1 号 p. 100-102
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/07/05

DOIhttps://doi.org/10.1250/ast.e24.34

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

State-of-the-art text-to-speech systems have improved in sound quality and have become increasingly large in terms of the number of subjects to detect differences in MOS evaluation, which uses the five-scale precision. The MUSHRA method can precisely detect differences in sound quality compared with the MOS method because sound qualities are rated on a relative scale of 0 to 100 on 101 scales. However, it has the drawback of requiring hidden reference and anchors; thus, it cannot detect cases exceeding the hidden reference. Our method, named Taut-MUSHRA, requires no hidden reference and anchors and instead adds two constraints to the subjects. As a result, compared with the MOS method, our Taut-MUSHRA method could more sensitively detect differences in sound quality.

抄録全体を表示

PDF形式でダウンロード (246K)
Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis

Hiroki Mori, Kota Furukawa

2025 年46 巻1 号 p. 103-105
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/08/23

DOIhttps://doi.org/10.1250/ast.e24.35

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.

抄録全体を表示

PDF形式でダウンロード (240K)
A prosthesis for speech sound disorders

Takayuki Arai

2025 年46 巻1 号 p. 106-110
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/07/05

DOIhttps://doi.org/10.1250/ast.e24.40

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

We have developed a prosthetic device for speech sound disorders based on our earlier vocal-tract model. The proposed device mainly consists of a mouth piece, lip plates, and imitation tongue. We first estimated the vocal-tract area functions, particularly when the tongue is at the resting position and when it is raised up. We then tested the output sounds produced by a human speaker using the device with different configurations of the imitation tongue and open/close gestures of the lip plate. The results showed that, while the prosthetic device produced sounds of only moderate quality, the phrases became more intelligible.

抄録全体を表示

PDF形式でダウンロード (626K)
Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education

Hideki Kawahara, Masanori Morise

2025 年46 巻1 号 p. 111-115
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/06/13

DOIhttps://doi.org/10.1250/ast.e24.43

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.

抄録全体を表示

PDF形式でダウンロード (738K)
Fast end-to-end non-parallel voice conversion based on speaker-adaptive neural vocoder with cycle-consistent learning

Shuhei Imai, Aoi Kanagaki, Takashi Nose, Shogo Fukawa, Akinori Ito

2025 年46 巻1 号 p. 116-119
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/08/23

DOIhttps://doi.org/10.1250/ast.e24.46

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

This paper proposes a fast end-to-end non-parallel voice conversion (VC) named Tachylone. In Thachylone, speaker conversion and waveform generation is performed by a single vocoder network. In the training of Tachylone, a pre-trained universal neural vocoder is used as the initial model, and the model parameters are updated using source and target speakers' non-parallel data based on cycle-consistent learning in an end-to-end manner. We compare Tachylone to conventional CycleGAN-based VC with objective and subjective measures and discuss the results.

抄録全体を表示

PDF形式でダウンロード (418K)
Unified model for voice conversion of speech and singing voice using adaptive pitch constraints

Shogo Fukawa, Takashi Nose, Shuhei Imai, Akinori Ito

2025 年46 巻1 号 p. 120-123
発行日: 2025/01/01
公開日: 2025/01/01
[早期公開] 公開日: 2024/07/26

DOIhttps://doi.org/10.1250/ast.e24.47

ジャーナルオープンアクセス

抄録を表示する抄録を非表示にする

This paper proposes a voice conversion named SpSiVC that appropriately converts both speech and singing voices with a single model. Since the distribution of pitch between speakers is significantly different for speech and singing voices, voice conversion has been mainly evaluated as a separate task for speech and singing voice conversion. SpSiVC introduces an adaptive F0 loss, which enables conversion that implicitly switches the shift width of the logarithm F0 according to the type of input voice. We examine the effectiveness of the F0 constraints in objective and subjective evaluations.

抄録全体を表示

PDF形式でダウンロード (239K)

J-STAGEへの登録はこちら（無料）