Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
Volume 42, Issue 6
Displaying 1-7 of 7 articles from this issue
PAPERS
  • Kento Yoshimoto, Hiroki Kuroda, Daichi Kitahara, Akira Hirabayashi
    2021 Volume 42 Issue 6 Pages 305-313
    Published: November 01, 2021
    Released on J-STAGE: November 01, 2021
    JOURNAL FREE ACCESS

    The present paper proposes a distortion pedal modeling method using the so-called WaveNet. A state-of-the-art method constructs a feedforward network by modifying the original autoregressive WaveNet, and trains it so that a loss function defined by the normalized mean squared error between the high-pass filtered outputs is minimized. This method works well for pedals with low distortion, but not for those with high distortion. To solve this problem, the proposed method exploits the same WaveNet, but a novel loss function, which is defined by a weighted sum of errors in time and time-frequency (T-F) domains. The error in the time domain is defined by the mean squared error without the high-pass filtering, while that in the T-F domain is defined by a divergence between spectral features computed from the short-time Fourier transform. Numerical experiments using a pedal with high distortion, the Ibanez SD9, show that the proposed method is capable of precisely reproducing high-frequency components without attenuation of low-frequency components compared to the state-of-the-art method.

    Download PDF (816K)
  • Toshiya Samejima
    2021 Volume 42 Issue 6 Pages 314-325
    Published: November 01, 2021
    Released on J-STAGE: November 01, 2021
    JOURNAL FREE ACCESS

    This paper is concerned with the extension of the existing nonlinear physical modeling sound synthesis of cymbals to that involving the dynamics of washers supporting the center of cymbals and sticks/mallets striking the cymbals. The body of a cymbal is physically modeled as a shallow spherical shell and its governing equation is discretized in space using the finite difference method, as was implemented in existing research. In addition, a washer, related to the support conditions of the cymbal, is modeled as a single-degree-of-freedom vibration system and involved in the physical model of the cymbal. Furthermore, a stick/mallet striking the cymbal is also involved by modeling it as a multi-degree-of-freedom system using the finite element method and coupled with the cymbal vibration. The time-domain differential of the total system is discretized using implicit finite difference schemes. Trial numerical calculations demonstrate that the developed method is effective in the sound synthesis of the cymbal, grasping the change of timbre due to the dynamics of washers and sticks/mallets.

    Download PDF (1414K)
  • Arif Ahmad, Md. Reza Selim, Md. Zafar Iqbal, M. Shahidur Rahman
    2021 Volume 42 Issue 6 Pages 326-332
    Published: November 01, 2021
    Released on J-STAGE: November 01, 2021
    JOURNAL FREE ACCESS

    This paper presents the Shahjalal University of Science and Technology Text-To-Speech Corpus (SUST TTS Corpus), a phonetically balanced speech corpus for Bangla speech synthesis. Due to the advancement of deep learning techniques, modern speech processing researches such as speech recognition and speech synthesis are being conducted in various deep learning methods. Any state-of-the-art neural TTS system needs a large dataset to be trained efficiently. The lack of such datasets for under-resourced languages like Bangla is a major obstacle for developing TTS systems in those languages. To mitigate this problem and accelerate speech synthesis research in Bangla, we have developed a large-scale, phonetically-balanced speech corpus containing more than 30 hours of speech. Our corpus includes 17,357 utterances spoken by a professional voice talent in a sound-proof audio laboratory. We ensure that the corpus contains all possible Bangla phonetic units in sufficient amounts, making it a phonetically-balanced speech corpus. We describe the process of creating the corpus in this paper. We also train a neural Bangla TTS system with our corpus and obtain a synthetic voice which is comparable to the state-of-the-art TTS systems.

    Download PDF (797K)
  • Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
    2021 Volume 42 Issue 6 Pages 333-343
    Published: November 01, 2021
    Released on J-STAGE: November 01, 2021
    JOURNAL FREE ACCESS

    Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-based model are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.

    Download PDF (872K)
TECHNICAL REPORT
  • Mao Terashima, Daisuke Morikawa, Parham Mokhtari, Tatsuya Hirahara
    2021 Volume 42 Issue 6 Pages 344-349
    Published: November 01, 2021
    Released on J-STAGE: November 01, 2021
    JOURNAL FREE ACCESS

    This article describes a linear microphone array used for measuring head-related impulse responses simultaneously at various radial distances using the reciprocal method. The microphone array consists of miniature 5.8 mm diameter electret condenser microphones (ECMs) arranged on a boom, using a 3D printed microphone holder with pillars. The frequency response of the ECM with the 1 mm thick band holder increased monotonically above 5 kHz and was 2 dB higher at 20 kHz compared to the frequency response of the bare ECM. When multiple ECMs were arranged on a boom at 200 mm intervals using microphone holders with a 3 mm diameter pillar, reflections from the pillar and boom were negligible, regardless of whether the height of the ECM from the boom was 30 mm or 60 mm.

    Download PDF (734K)
ACOUSTICAL LETTERS
feedback
Top