ITE Transactions on Media Technology and Applications
Online ISSN : 2186-7364
ISSN-L : 2186-7364
Regular Section
[Paper] PSp-Transformer: A Transformer with Data-level Probabilistic Sparsity for Action Representation Learning
Jiaxin ZhouTakashi Komuro
Author information
JOURNAL FREE ACCESS

2024 Volume 12 Issue 1 Pages 123-132

Details
Abstract

In this paper, we propose a method for action representation learning from spatiotemporal signals of salient pixel-value changes and salient skeleton motion cues using both videos and skeleton sequences. The method simultaneously implements prediction of position relationships of movements with salient pixel-value changes using a vision transformer and multimodality-contrastive learning between representations respectively learned from videos and skeleton sequences. Our method is unsupervised and does not rely on semantic annotations to associate input data with actions. Instead of entire videos, sparse parts of videos are taken as training data, which are picked up according to probabilistic values of the size of pixel-value changes of movements. In experiments using supervised settings, our proposed network obtained remarkable generalization ability and higher accuracies. In experiments using unsupervised settings, our method achieved state-of-the-art performance. The experimental results demonstrate the superiority of the proposed method, which efficiently learns discriminative features.

Content from these authors
© 2024 The Institute of Image Information and Television Engineers
Previous article Next article
feedback
Top