Article ID: 2025EAP1011
In recent years, the performance of phase-aware speech enhancement neural networks has steadily improved. However, dealing with complex-valued Short-Time Fourier Transform (STFT) spectrograms involves complex operations and phase estimation, which increases the complexity and parameter number of the model. To address this, we have built upon the foundation of DCTCRN and introduced real-valued Short-Time Discrete Cosine Transform (STDCT) spectrograms as input features, which avoids the complexities associated with phase estimation and modeling amplitude-phase relationships. To further enhance skip connections without increasing parameters, we have incorporated the SimAM attention mechanism. Additionally, we have added dual-path RNN modules between the encoder and decoder to capture long dependencies in both time and frequency dimensions. We have also introduced Hardtanh as the new scaling function. Through comparative experiments and ablation studies, we have confirmed the effectiveness of using STDCT spectrograms, attention mechanism and Hardtanh scaling function. Our approach demonstrates higher competitiveness in objective performance metrics compared to recent speech enhancement models. Notably, it achieves this while maintaining a relatively low parameter number, thus raising the performance ceiling of the DCTCRN series models.