2023 Volume E106.D Issue 6 Pages 1165-1174
We propose a framework for the integration of heterogeneous networks in human pose estimation (HPE) with the aim of balancing accuracy and computational complexity. Although many existing methods can improve the accuracy of HPE using multiple frames in videos, they also increase the computational complexity. The key difference here is that the proposed heterogeneous framework has various networks for different types of frames, while existing methods use the same networks for all frames. In particular, we propose to divide the video frames into two types, including key frames and non-key frames, and adopt three networks including slow networks, fast networks, and transfer networks in our heterogeneous framework. For key frames, a slow network is used that has high accuracy but high computational complexity. For non-key frames that follow a key frame, we propose to warp the heatmap of a slow network from a key frame via a transfer network and fuse it with a fast network that has low accuracy but low computational complexity. Furthermore, when extending to the usage of long-term frames where a large number of non-key frames follow a key frame, the temporal correlation decreases. Therefore, when necessary, we use an additional transfer network that warps the heatmap from a neighboring non-key frame. The experimental results on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed FSPose achieves a better balance between accuracy and computational complexity than the competitor method. Our source code is available at https://github.com/Fenax79/fspose.