論文ID: 2024EDP7158
The rapid advancement of autonomous driving has heightened safety concerns, making it essential to adopt a comprehensive approach for secure navigation. Multi-modal methods for 3D object detection play a critical role in enhancing driving safety by integrating data from various sensor types. However, existing methods face challenges, such as feature misalignment and loss, which can lead to overfitting and undermine perception performance and reliability. Building on findings that suggest excluding direct camera branch features from the regression task can improve detection performance, this paper delves deeper into the detection pipeline and introduces a novel multi-modal 3D object detection approach. The proposed approach starts with the introduction of an attention-based module designed to align features across different modalities, thereby enhancing the fusion of features through channel and spatial attention mechanisms. Additionally, an image-guided feature candidate generation strategy is employed to identify candidate regions within the fused features. These original features are then divided into two distinct branches for regression and classification tasks, which are subsequently processed by the detection heads. This approach reduces the model's dependence on precise depth estimation from the image branch and minimizes the impact of sensor calibration errors. Experimental results validate that the proposed method delivers outstanding detection performance. Notably, our best model achieves a competitive performance of 71.0% mAP and 74.0% NDS, while demonstrating strong robustness in scenarios with missing camera data, underscoring its capability to manage complex real-world situations.