Acoustical Science and Technology
Online ISSN : 1347-5177
Print ISSN : 1346-3969
ISSN-L : 0369-4232
最新号
選択された号の論文の7件中1~7を表示しています
PAPERS
  • Kento Yoshimoto, Hiroki Kuroda, Daichi Kitahara, Akira Hirabayashi
    2021 年 42 巻 6 号 p. 305-313
    発行日: 2021/11/01
    公開日: 2021/11/01
    ジャーナル フリー

    The present paper proposes a distortion pedal modeling method using the so-called WaveNet. A state-of-the-art method constructs a feedforward network by modifying the original autoregressive WaveNet, and trains it so that a loss function defined by the normalized mean squared error between the high-pass filtered outputs is minimized. This method works well for pedals with low distortion, but not for those with high distortion. To solve this problem, the proposed method exploits the same WaveNet, but a novel loss function, which is defined by a weighted sum of errors in time and time-frequency (T-F) domains. The error in the time domain is defined by the mean squared error without the high-pass filtering, while that in the T-F domain is defined by a divergence between spectral features computed from the short-time Fourier transform. Numerical experiments using a pedal with high distortion, the Ibanez SD9, show that the proposed method is capable of precisely reproducing high-frequency components without attenuation of low-frequency components compared to the state-of-the-art method.

  • Toshiya Samejima
    2021 年 42 巻 6 号 p. 314-325
    発行日: 2021/11/01
    公開日: 2021/11/01
    ジャーナル フリー

    This paper is concerned with the extension of the existing nonlinear physical modeling sound synthesis of cymbals to that involving the dynamics of washers supporting the center of cymbals and sticks/mallets striking the cymbals. The body of a cymbal is physically modeled as a shallow spherical shell and its governing equation is discretized in space using the finite difference method, as was implemented in existing research. In addition, a washer, related to the support conditions of the cymbal, is modeled as a single-degree-of-freedom vibration system and involved in the physical model of the cymbal. Furthermore, a stick/mallet striking the cymbal is also involved by modeling it as a multi-degree-of-freedom system using the finite element method and coupled with the cymbal vibration. The time-domain differential of the total system is discretized using implicit finite difference schemes. Trial numerical calculations demonstrate that the developed method is effective in the sound synthesis of the cymbal, grasping the change of timbre due to the dynamics of washers and sticks/mallets.

  • Arif Ahmad, Md. Reza Selim, Md. Zafar Iqbal, M. Shahidur Rahman
    2021 年 42 巻 6 号 p. 326-332
    発行日: 2021/11/01
    公開日: 2021/11/01
    ジャーナル フリー

    This paper presents the Shahjalal University of Science and Technology Text-To-Speech Corpus (SUST TTS Corpus), a phonetically balanced speech corpus for Bangla speech synthesis. Due to the advancement of deep learning techniques, modern speech processing researches such as speech recognition and speech synthesis are being conducted in various deep learning methods. Any state-of-the-art neural TTS system needs a large dataset to be trained efficiently. The lack of such datasets for under-resourced languages like Bangla is a major obstacle for developing TTS systems in those languages. To mitigate this problem and accelerate speech synthesis research in Bangla, we have developed a large-scale, phonetically-balanced speech corpus containing more than 30 hours of speech. Our corpus includes 17,357 utterances spoken by a professional voice talent in a sound-proof audio laboratory. We ensure that the corpus contains all possible Bangla phonetic units in sufficient amounts, making it a phonetically-balanced speech corpus. We describe the process of creating the corpus in this paper. We also train a neural Bangla TTS system with our corpus and obtain a synthetic voice which is comparable to the state-of-the-art TTS systems.

  • Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara
    2021 年 42 巻 6 号 p. 333-343
    発行日: 2021/11/01
    公開日: 2021/11/01
    ジャーナル フリー

    Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-based model are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.

TECHNICAL REPORT
ACOUSTICAL LETTERS
feedback
Top