International Journal of Affective Engineering
Online ISSN : 2187-5413
ISSN-L : 2187-5413

This article has now been updated. Please use the final version.

Transformer-based Siamese and Triplet Networks for Facial Expression Intensity Estimation
Author information
JOURNAL FREE ACCESS Advance online publication

Article ID: IJAE-D-22-00011


Recognizing facial expressions and estimating their corresponding action units’ intensities have achieved many milestones. However, such estimating is still challenging due to subtle action units’ variations during emotional arousal. The latest approaches are confined to the probabilistic models’ characteristics that model action units’ relationships. Considering ordinal relationships across an emotional transition sequence, we propose two metric learning approaches with self-attention-based triplet and Siamese networks to estimate emotional intensities. Our emotion expert branches use shifted-window SWIN-transformer which restricts self-attention computation to adjacent windows while also allowing for cross-window connection. This offers flexible modeling at various scales of action units with high performance. We evaluated our network’s spatial and time-based feature localization on CK+, KDEF-dyn, AFEW, SAMM, and CASME-II datasets. They outperform deep learning state-of-the-art methods in micro-expression detection on the latter two datasets with 2.4% and 2.6% UAR respectively. Ablation studies highlight the strength of our design with a thorough analysis.

Content from these authors
© 2022 Japan Society of Kansei Engineering