Video Object Segmentation via Adaptive Multi-feature Fusion of Foreground and Background

Xuejun LI; Yuan ZONG; Jie ZHU; Cheng LU; Chuangao TANG

doi:10.1587/transinf.2025EDL8031

抄録

Semi-supervised video object segmentation (SVOS) is a challenging task that uses an initial frame mask to predict the segmentation of target objects in subsequent frames. Recently, various VOS methods have combined matching-based transductive inference with online inductive learning to capture more precise spatiotemporal information, thereby enhancing segmentation accuracy. However, while these methods improve feature extraction capabilities, they still fail to adequately address the full fusion of different features for more efficient feature utilization. To address the issue of low efficiency in feature fusion utilization in SVOS, we propose an adaptive multi-feature fusion method in this letter. This method proposes a Foreground-Background Multi-feature Encoder to effectively enhance feature diversity and uses a Multi-feature Fusion Module to dynamically integrate spatiotemporal cues from both the foreground and background. For different segmentation targets, the method employs a Feature Fusion Reader to autonomously select and adaptively fuse multiple foreground-background features, thereby achieving inter-feature optimization and significantly improving target-specific fusion efficiency. Extensive experiments on DAVIS 2017 and large-scale YouTube-VOS 2018/2019 datasets demonstrate that our proposed method achieves state-of-the-art performance.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PPV is available from https://globals.ieice.org/en_transactions/information

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）