論文ID: 2025EDL8031
Semi-supervised video object segmentation (SVOS) is a challenging task that uses an initial frame mask to predict the segmentation of target objects in subsequent frames. Recently, various VOS methods have combined matching-based transductive inference with online inductive learning to capture more precise spatiotemporal information, thereby enhancing segmentation accuracy. However, while these methods improve feature extraction capabilities, they still fail to adequately address the full fusion of different features for more efficient feature utilization. To address the issue of low efficiency in feature fusion utilization in SVOS, we propose an adaptive multi-feature fusion method in this letter. This method proposes a Foreground-Background Multi-feature Encoder to effectively enhance feature diversity and uses a Multi-feature Fusion Module to dynamically integrate spatiotemporal cues from both the foreground and background. For different segmentation targets, the method employs a Feature Fusion Reader to autonomously select and adaptively fuse multiple foreground-background features, thereby achieving inter-feature optimization and significantly improving target-specific fusion efficiency. Extensive experiments on DAVIS 2017 and large-scale YouTube-VOS 2018/2019 datasets demonstrate that our proposed method achieves state-of-the-art performance.