論文ID: 2024EDP7125
Weakly Supervised Semantic Segmentation (WSSS) aims to train models to identify and delineate objects within an image using limited training data such as image-level labels. While recent works mainly focus on exploring class-specific knowledge to improve the quality of class activation maps, we contend that relying solely on this approach within a non-hierarchical architecture fails to adequately capture the structural relationships within images. Drawing inspiration from fully supervised semantic segmentation designs, which use hierarchical multi-scale feature maps for predicting the dense masks, we propose a novel architecture that integrates a Structural Relation Multi-class Token Transformer (SR-MCT) with WSSS. This model employs multi-scale structural tokens, generated by a Spatial Prior Module (SPM), which interact not only with patch tokens to encode structural relations, but also with multi-class tokens to integrate class-specific knowledge into complex structural embeddings. The proposed Structural Relation Multi-class Token Attention effectively builds long-range dependencies among structural tokens, patch tokens, and multi-class tokens simultaneously. Experimental results and ablation studies on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed SR-MCT can enhance baseline performance and outperform other state-of-the-art methods.