論文ID: 2024EAP1162
The skeleton modality provides an efficient representation of human pose. However, its lack of appearance information can lead to poor performance in tasks requiring such information. To address this, we propose a multimodal skeleton representation that integrates intermediate feature maps from a pose estimation network, called Pose Feature Map Enhanced Skeleton Representation (PFMESR). Specifically, we estimate the joint positions of the human body in the video and locate the local features related to each joint from the feature maps of the pose estimation network. These local features are then aligned and fused with the skeleton features in the action recognition network. We believe that the feature maps from the pose estimation network contain rich appearance information that complements the skeleton information. Experiments on multiple datasets demonstrate that this approach significantly improves action recognition performance and yields favorable results in the Action-Identity Recognition task, proving the effectiveness of incorporating appearance information from pose estimation feature maps. We also investigated the relationship between PFMESR's performance and sampling depth and range to explore its effectiveness under different parameters. Additionally, we validated the generality of PFMESR by applying it to various skeleton-based methods. Our method surpasses the state-of-the-art on multiple skeleton-based action recognition benchmarks, achieving accuracies of 94.6 % on the NTU RGB+D 60 cross-subject split, 97.7 % on the NTU RGB+D 60 cross-view split, and 93.1 % on the NTU RGB+D 120 cross-subject split.