[Paper] PSp-Transformer: A Transformer with Data-level Probabilistic Sparsity for Action Representation Learning

Jiaxin Zhou; Takashi Komuro

doi:10.3169/mta.12.123

Abstract

In this paper, we propose a method for action representation learning from spatiotemporal signals of salient pixel-value changes and salient skeleton motion cues using both videos and skeleton sequences. The method simultaneously implements prediction of position relationships of movements with salient pixel-value changes using a vision transformer and multimodality-contrastive learning between representations respectively learned from videos and skeleton sequences. Our method is unsupervised and does not rely on semantic annotations to associate input data with actions. Instead of entire videos, sparse parts of videos are taken as training data, which are picked up according to probabilistic values of the size of pixel-value changes of movements. In experiments using supervised settings, our proposed network obtained remarkable generalization ability and higher accuracies. In experiments using unsupervised settings, our method achieved state-of-the-art performance. The experimental results demonstrate the superiority of the proposed method, which efficiently learns discriminative features.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!