Structural Relation Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

Dingjie PENG; Wataru KAMEYAMA

doi:10.1587/transinf.2024EDP7125

抄録

Weakly Supervised Semantic Segmentation (WSSS) aims to train models to identify and delineate objects within an image using limited training data such as image-level labels. While recent works mainly focus on exploring class-specific knowledge to improve the quality of class activation maps, we contend that relying solely on this approach within a non-hierarchical architecture fails to adequately capture the structural relationships within images. Drawing inspiration from fully supervised semantic segmentation designs, which use hierarchical multi-scale feature maps for predicting the dense masks, we propose a novel architecture that integrates a Structural Relation Multi-class Token Transformer (SR-MCT) with WSSS. This model employs multi-scale structural tokens, generated by a Spatial Prior Module (SPM), which interact not only with patch tokens to encode structural relations, but also with multi-class tokens to integrate class-specific knowledge into complex structural embeddings. The proposed Structural Relation Multi-class Token Attention effectively builds long-range dependencies among structural tokens, patch tokens, and multi-class tokens simultaneously. Experimental results and ablation studies on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed SR-MCT can enhance baseline performance and outperform other state-of-the-art methods.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PPV is available from https://globals.ieice.org/en_transactions/information

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）