ITE Transactions on Media Technology and Applications
Online ISSN : 2186-7364
ISSN-L : 2186-7364
Regular Section
[Paper] PSp-Transformer: A Transformer with Data-level Probabilistic Sparsity for Action Representation Learning
Jiaxin ZhouTakashi Komuro
著者情報
ジャーナル フリー

2024 年 12 巻 1 号 p. 123-132

詳細
抄録

In this paper, we propose a method for action representation learning from spatiotemporal signals of salient pixel-value changes and salient skeleton motion cues using both videos and skeleton sequences. The method simultaneously implements prediction of position relationships of movements with salient pixel-value changes using a vision transformer and multimodality-contrastive learning between representations respectively learned from videos and skeleton sequences. Our method is unsupervised and does not rely on semantic annotations to associate input data with actions. Instead of entire videos, sparse parts of videos are taken as training data, which are picked up according to probabilistic values of the size of pixel-value changes of movements. In experiments using supervised settings, our proposed network obtained remarkable generalization ability and higher accuracies. In experiments using unsupervised settings, our method achieved state-of-the-art performance. The experimental results demonstrate the superiority of the proposed method, which efficiently learns discriminative features.

著者関連情報
© 2024 The Institute of Image Information and Television Engineers
前の記事 次の記事
feedback
Top