Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Volume 46, Issue 1
—Special Issue on Speech Diversity and Its Applications—
Displaying 1-20 of 20 articles from this issue
PAPERS
  • Shota Okubo, Toshiharu Horiuchi
    2025 Volume 46 Issue 1 Pages 1-10
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: September 21, 2024
    JOURNAL OPEN ACCESS

    The finite difference time domain (FDTD) method has been proposed and used for sound field simulation. To reproduce actual sound wave propagation in sound field simulations, it is necessary to apply the radiation characteristics. With the FDTD method, radiation characteristics can be applied by setting sound pressure in a dense grid arrangement. However, conventional techniques for capturing radiation characteristics use a sparse array of microphones and are considered insufficient for the FDTD simulation. Furthermore, the technique required to apply captured acoustic signals in a dense grid arrangement with the FDTD method has not been considered. In this paper, we propose a novel hardware and software system that captures the radiation characteristics for a dense grid arrangement and applies them to the FDTD method, while controlling the sound wave propagation with the non-propagation region. The proposed system produces the average differences from measured values of sound pressure, propagation time, center frequency, and log-spectral distortion of 1.8 dB, 0.04 ms, 700 Hz, and 3.5 dB, respectively, which is more accurate than the conventional techniques. The result shows that this system is useful for improving the accuracy of sound wave propagation reproduction with the sound field simulation.

    Download PDF (1379K)
  • Tong Zhou, Kazuya Yasueda, Ghada Bouattour, Anthimos Georgiadis, Akito ...
    2025 Volume 46 Issue 1 Pages 11-21
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: September 27, 2024
    JOURNAL OPEN ACCESS

    This study introduces bidirectional stepwise-based algorithms designed to optimize loudspeaker array configurations for Multizone Sound Field Reproduction systems. An initial arrangement selection method based on loudspeaker magnitude enhances the optimization process. These algorithms were validated using the Acoustic Contrast Control and Pressure Matching methods across free-field conditions and a comprehensive Room Impulse Response database including various room conditions. Comparative experiments against traditional unidirectional iterative strategies demonstrate that the proposed algorithms significantly outperform existing methods in terms of efficiency and effectiveness, especially in configurations with fewer loudspeakers. For example, in a small meeting room with 16 loudspeakers, the stepwise-based approaches achieved higher acoustic contrast and required substantially fewer iterations than conventional methods. Specifically, optimization efficiency improvements were about 55.2% and 77.8% in Acoustic Contrast Control and 36.7% and 68.6% in Pressure Matching, compared to conventional iteratively adding or removing approaches.

    Download PDF (997K)
  • Hikaru Miura
    2025 Volume 46 Issue 1 Pages 22-29
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: October 05, 2024
    JOURNAL OPEN ACCESS

    This paper describes the development of a compact ultrasonic vibration source that has a transverse vibrating plate that can achieve large displacement amplitudes. An ultrasonic vibration source was designed, in which the ultrasonic vibrator excluding the transducer was approximately the same length as the transducer (half the wavelength of the longitudinal vibration). Therefore, the ultrasonic vibrator was integrated with the transverse vibrating plate and the amplitude expansion horn. The design method for integrating the ultrasonic vibration source was clarified, and the vibration characteristics of the vibration source were investigated. The ultrasonic source was used to atomize droplets, demonstrating its practical utility.

    Download PDF (1097K)
ACOUSTICAL LETTERS
—Special Issue on Speech Diversity and Its Applications—
FOREWORD
INVITED PAPERS
  • Kikuo Maekawa
    2025 Volume 46 Issue 1 Pages 45-54
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: August 09, 2024
    JOURNAL OPEN ACCESS

    Real-time MRI video imaging has had a significant impact on articulatory phonetics. Many new findings have been obtained using this technology that enables the objective observation of the whole vocal tract under speech production, which has long been imagined by subjective retrospection. In this paper, I introduce the specifications of the "Real-time MRI Articulatory Movement Database (rtMRIDB)" that my colleagues and I developed and its relevance to the study of diversity in Japanese phonetics. Some ongoing technological developments are also introduced.

    Download PDF (926K)
  • Yongwei Li, Aijun Li, Jianhua Tao, Feng Li, Donna Erickson, Masato Aka ...
    2025 Volume 46 Issue 1 Pages 55-63
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: August 24, 2024
    JOURNAL OPEN ACCESS

    Emotions are usually perceived by multimodal cues for human communications; in recent years, emotions have been studied from the perspective of dimensional approaches. Investigation of audio and video cues to emotion perception in terms of categories of emotion has been relatively extensively conducted, but the contribution of audio and video cues to emotion perception in dimensional space is relatively under-investigated, especially in Mandarin Chinese. In this present study, three psychoacoustic experiments were conducted to investigate the contributions of audio, visual, and audio-visual modalities to emotional perception in the valence and arousal space. Audio-only, video-only, and audio-video modalities were presented to native Chinese subjects with normal hearing and vision for perceptual ratings of emotion in the valence and arousal dimensions. Results suggested that (1) different modalities contribute differently to perceiving valence and arousal dimensions; (2) compared to video-only modality, audio-only modality generally decreases arousal and valence at lower levels, and increases arousal and valence at higher levels; (3) the video-only modality plays an important role in separating anger and happiness emotions in the valence space.

    Download PDF (530K)
INVITED REVIEW
  • Koichi Mori
    2025 Volume 46 Issue 1 Pages 64-69
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: June 08, 2024
    JOURNAL OPEN ACCESS

    The aim of this review is to introduce the concept of neurodiversity as used for developmental stuttering. Since the introduction of the ICF by WHO in 2001, the social model has been introduced into clinical practice. However, it primarily asks the community to be responsible for the accommodation of persons with disabilities (PDs). In addition to the necessity of changes in the legal and legislative environments to conform to the Convention on the Rights of Persons with Disabilities of the United Nations (2006), effective education and advocacy are needed for society to acknowledge and reduce biases of ableism and stigma of disabilities. Ableism is the claim that society is for able-bodied and able-minded people. Ableism remarks and behaviors may impact PDs adversely and are called microaggressions. The diversity movement tries to embrace PDs by removing the border between the able and the disabled. The etiology and characteristics of developmental stuttering are depicted, as well as its neurodiverse and complex nature. The recent advances in the treatment of stuttering without ableism are introduced. Education and advocacy of (neuro)diversity and inclusion in society are still sorely needed for medical and welfare professionals as well as for the general public.

    Download PDF (136K)
PAPER
  • Hiroki Mori, Hironao Nishino
    2025 Volume 46 Issue 1 Pages 70-77
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: August 07, 2024
    JOURNAL OPEN ACCESS

    We propose an end-to-end conversational speech synthesis system that allows for flexible control of emotional states defined over emotion dimensions. We extend the Tacotron 2 and VITS architectures to accept emotion dimensions as input. Initially, the model is pre-trained using a large-scale spontaneous speech corpus, followed by fine-tuning using a natural dialogue speech corpus with manually annotated perceived emotion in the form of pleasantness and arousal. Since the pre-training lacks emotion information, we explore two pre-training strategies and demonstrate that applying an emotion dimension estimator before the pre-training enhances emotion controllability. Evaluation of the synthesized speech using VITS yields a mean opinion score of 4 or higher for naturalness. Furthermore, there is a correlation of R=0.53 for pleasantness and R=0.89 for arousal between the given and perceived emotional states. These results underscore the effectiveness of our proposed conversational speech synthesis system with emotion control.

    Download PDF (934K)
TECHNICAL REPORTS
  • Yoshiko Arimoto, Yasuo Horiuchi, Sumio Ohno
    2025 Volume 46 Issue 1 Pages 78-86
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: May 11, 2024
    JOURNAL OPEN ACCESS

    A reliable method of determining the base frequency (Fb) for utterances of various speaking styles is critical to enabling stable command labeling in the Fujisaki model. To achieve stable command labeling for diverse expressions of speech, a linear fitted model was developed using the ten percentile F0 of each utterance from three corpora of various speaking styles (read, acted, and spontaneous) as the independent variable to estimate a consistent Fb for each utterance. To assess the robustness of the model for unknown utterances, the model was applied to test data, including both open and corpus-open data not used for the model development, and the difference between the estimated Fb and the trained labelers' annotated Fb was calculated. As a result, the obtained estimation model was found to fit well to the manually labeled Fbs by exhibiting a small root mean squared error (RMSE) of 0.096 and a high coefficient of determination (R2) of 0.89 for the closed dataset. Moreover, the model also exhibited a small RMSE of 0.091 and a high R2 of 0.92 for the corpus-open dataset. The results revealed that the proposed model can reliably estimate the Fb of utterances with various speaking styles.

    Download PDF (539K)
  • Mizuki Nagano, Yusuke Ijima, Sadao Hiroya
    2025 Volume 46 Issue 1 Pages 87-95
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: August 01, 2024
    JOURNAL OPEN ACCESS

    The retail industries strive to enhance the willingness to buy through various elements, such as store environment, layout, and advertising. Speech is one of the most effective methods used in advertising, particularly in broadcast advertising. Our previous study indicated that the stimulus-organism-response (SOR) theory, using emotional states, can partially explain the effect of advertising speech on the willingness to buy. It suggests that emotional states alone are not sufficient to explain this effect. In this study, we conducted an experiment to determine whether adding semantic primitives to the emotion-mediated SOR model could completely mediate the impact of advertising speech on the willingness to buy. During the study, participants listened to speech with modified features (mean fundamental frequency (F0), speech rate, or standard deviation of F0) and rated their willingness to buy the advertised products, as well as their own emotions and semantic primitives. We found that adding semantic primitives as a mediator can completely mediate the willingness to buy from the standard deviation of F0 in the advertising speech. These results will be useful for developing speech synthesis methods aimed at increasing people's willingness to buy.

    Download PDF (311K)
ACOUSTICAL LETTERS
  • Shoki Kawanishi, Yuya Chiba, Akinori Ito, Takashi Nose
    2025 Volume 46 Issue 1 Pages 96-99
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: September 25, 2024
    JOURNAL OPEN ACCESS

    Lip syncing is an important technology that enhances the impression of embodied conversational agents. However, there is no study to design the mouth movement of the agent when the agent is silent. Therefore, this paper investigated how human speakers move their mouths when silent in dialogues. As a result, we found three facts. First, a speaker does not completely close mouth even when listening to a partner's talk. Second, the degree of mouth opening while talking and listening greatly depends on the speaker. Third, the mouth opening is possibly affected by the next state of the speaker.

    Download PDF (576K)
  • Fumiyoshi Matano, Yuya Tagusari, Takanori Horibe, Junya Koguchi, Masan ...
    2025 Volume 46 Issue 1 Pages 100-102
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: July 05, 2024
    JOURNAL OPEN ACCESS

    State-of-the-art text-to-speech systems have improved in sound quality and have become increasingly large in terms of the number of subjects to detect differences in MOS evaluation, which uses the five-scale precision. The MUSHRA method can precisely detect differences in sound quality compared with the MOS method because sound qualities are rated on a relative scale of 0 to 100 on 101 scales. However, it has the drawback of requiring hidden reference and anchors; thus, it cannot detect cases exceeding the hidden reference. Our method, named Taut-MUSHRA, requires no hidden reference and anchors and instead adds two constraints to the subjects. As a result, compared with the MOS method, our Taut-MUSHRA method could more sensitively detect differences in sound quality.

    Download PDF (246K)
  • Hiroki Mori, Kota Furukawa
    2025 Volume 46 Issue 1 Pages 103-105
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: August 23, 2024
    JOURNAL OPEN ACCESS

    In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.

    Download PDF (240K)
  • Takayuki Arai
    2025 Volume 46 Issue 1 Pages 106-110
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: July 05, 2024
    JOURNAL OPEN ACCESS

    We have developed a prosthetic device for speech sound disorders based on our earlier vocal-tract model. The proposed device mainly consists of a mouth piece, lip plates, and imitation tongue. We first estimated the vocal-tract area functions, particularly when the tongue is at the resting position and when it is raised up. We then tested the output sounds produced by a human speaker using the device with different configurations of the imitation tongue and open/close gestures of the lip plate. The results showed that, while the prosthetic device produced sounds of only moderate quality, the phrases became more intelligible.

    Download PDF (626K)
  • Hideki Kawahara, Masanori Morise
    2025 Volume 46 Issue 1 Pages 111-115
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: June 13, 2024
    JOURNAL OPEN ACCESS

    We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.

    Download PDF (738K)
  • Shuhei Imai, Aoi Kanagaki, Takashi Nose, Shogo Fukawa, Akinori Ito
    2025 Volume 46 Issue 1 Pages 116-119
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: August 23, 2024
    JOURNAL OPEN ACCESS

    This paper proposes a fast end-to-end non-parallel voice conversion (VC) named Tachylone. In Thachylone, speaker conversion and waveform generation is performed by a single vocoder network. In the training of Tachylone, a pre-trained universal neural vocoder is used as the initial model, and the model parameters are updated using source and target speakers' non-parallel data based on cycle-consistent learning in an end-to-end manner. We compare Tachylone to conventional CycleGAN-based VC with objective and subjective measures and discuss the results.

    Download PDF (418K)
  • Shogo Fukawa, Takashi Nose, Shuhei Imai, Akinori Ito
    2025 Volume 46 Issue 1 Pages 120-123
    Published: January 01, 2025
    Released on J-STAGE: January 01, 2025
    Advance online publication: July 26, 2024
    JOURNAL OPEN ACCESS

    This paper proposes a voice conversion named SpSiVC that appropriately converts both speech and singing voices with a single model. Since the distribution of pitch between speakers is significantly different for speech and singing voices, voice conversion has been mainly evaluated as a separate task for speech and singing voice conversion. SpSiVC introduces an adaptive F0 loss, which enables conversion that implicitly switches the shift width of the logarithm F0 according to the type of input voice. We examine the effectiveness of the F0 constraints in objective and subjective evaluations.

    Download PDF (239K)
feedback
Top