論文ID: 2024EDL8099
Multi-resolution spectrum feature analysis has demonstrated superior performance over traditional single-resolution methods in speech enhancement. However, previous multi-resolution-based methods typically have limited use of multi-resolution features, and some suffer from high model complexity. In this paper, we propose a more lightweight method that fully leverages the multi-resolution spectrum features. Our approach is based on a convolutional recurrent network (CRN) and employs a low-complexity multi-resolution spectrum fusion (MRSF) block to handle and fuse multi-resolution noisy spectrum information. We also improve the existing encoder-decoder structure, enabling the model to extract and analyze multi-resolution features more effectively. Furthermore, we adopt the short-time discrete cosine transform (STDCT) for time-frequency transformation, avoiding the phase estimation problem. To optimize our model, we design a multi-resolution STDCT loss function. Experiments demonstrate that the proposed multi-resolution STDCT-based CRN (MRCRN) achieves excellent performance and outperforms current advanced systems.