2025 Volume E108.A Issue 3 Pages 332-341
Action quality assessment (AQA) has gained prominence as it finds widespread applications in various scenarios. Most existing methods directly regress from single or pairwise videos, which leads to redundant temporal features and limited views affecting the scoring mechanism. Moreover, direct regression only applies supervision to the last layer, which leads to hardship in optimizing the intermediate layers such as gradient vanishing. To end this, we propose a Hierarchical Joint Training based Replay-Guided Contrastive Transformer, learned by a temporal concentration module. For network architecture, we design an extra contrastive module for the input and its replay, and the consistency of scores guides the model to learn the features of the same action under different views. A temporal concentration module is proposed to extract concentrated features such as errors or highlights, which are crucial factors affecting scoring. The proposed hierarchical joint training provides supervision on both shallow and deep layers, enhancing the performance of the scoring mechanism and speed of training convergence. Extensive experiments demonstrate that our method achieves Spearman’s Rank Correlation of 0.9642 on the RFSJ dataset, which is the new state-of-the-art result.