Mechanical Engineering Journal
Online ISSN : 2187-9745
ISSN-L : 2187-9745
Dynamics & Control, Robotics & Mechatronics
Deep encoder and decoder for time-domain speech separation
Kohei TAKAHASHIToshihiko SHIRAISHI
Author information
JOURNAL OPEN ACCESS

2023 Volume 10 Issue 5 Pages 23-00124

Details
Abstract

The previous research of speech separation has significantly improved separation performance based on the time-domain method: encoder, separator, and decoder. Most research has focused on revising the architecture of the separator. In contrast, a single 1-D convolution layer and 1-D transposed convolution layer have been used as encoder and decoder, respectively. This study proposes deep encoder and decoder architectures, consisting of stacked 1-D convolution layers, 1-D transposed convolution layers, or residual blocks, for the time-domain speech separation. The intentions of revising them are to improve separation performance and overcome the tradeoff between separation performance and computational cost due to their stride by enhancing their mapping ability. We applied them to Conv-TasNet, the typical model in the time-domain speech separation. Our results indicate that the better separation performance is archived as the number of their layers increases and that changing the number of their layers from 1 to 12 results in more than 1 dB improvement of SI-SDR on WSJ0-2mix. Additionally, it is suggested that the encoder and decoder should be deeper, corresponding to their stride since their task may be more difficult as the stride becomes larger. This study represents the importance of improving these architectures as well as separators.

Content from these authors
© 2023 The Japan Society of Mechanical Engineers

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
Previous article Next article
feedback
Top