2025 Volume 12 Issue 2 Pages 76-86
Semantic segmentation is an important technique in various applications, such as autonomous driving, medical imaging, and industrial inspection. Depth estimation, as one of the important components of scene understanding, can be used to obtain effective depth information while utilizing only RGB images. In recent years, such depth information has been used as an auxiliary feature to facilitate the semantic segmentation task. This study proposes a Simultaneous Fusion Network(SF-Net) that simultaneously learns semantic segmentation and depth estimation tasks based on a monocular camera image. The features are first extracted and strengthened by injecting contextual information using semantic labels through the feature reinforcement module and then learned simultaneously by analyzing the imaging process to establish the relationship between the size and depth of the objects in the image. A new loss function is represented by the geometric relationship. Furthermore, a feature fusion module is constructed to perform image feature fusion on the common parts of depth estimation and semantic segmentation tasks. By learning simultaneously, the accuracy of semantic segmentation can be improved by utilizing the depth information obtained from depth estimation inference. We conducted experiments using the Cityscapes dataset and the NYUDv2 dataset and verified the effectiveness of the proposed method.