[Paper] PSp-Transformer: A Transformer with Data-level Probabilistic Sparsity for Action Representation Learning

Jiaxin Zhou; Takashi Komuro

doi:10.3169/mta.12.123

抄録

In this paper, we propose a method for action representation learning from spatiotemporal signals of salient pixel-value changes and salient skeleton motion cues using both videos and skeleton sequences. The method simultaneously implements prediction of position relationships of movements with salient pixel-value changes using a vision transformer and multimodality-contrastive learning between representations respectively learned from videos and skeleton sequences. Our method is unsupervised and does not rely on semantic annotations to associate input data with actions. Instead of entire videos, sparse parts of videos are taken as training data, which are picked up according to probabilistic values of the size of pixel-value changes of movements. In experiments using supervised settings, our proposed network obtained remarkable generalization ability and higher accuracies. In experiments using unsupervised settings, our method achieved state-of-the-art performance. The experimental results demonstrate the superiority of the proposed method, which efficiently learns discriminative features.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）