2023 年 62 巻 6 号 p. 622-632
In this review paper, we report our model for recovering a 3D human mesh from a single 2D monocular image, called Deformable mesh transFormer (DeFormer) 1) which was published at the CVPR 2023 conference. While the current state-of-the-art models enable good performances by taking advantage of the transformer architecture to model long-range dependencies on input tokens, they suffer from a high computational cost due to the use of the standard transformer attention mechanism whose complexity is quadratic in the input sequence length. Therefore, we developed DeFormer, a human mesh recovery method that is equipped with two computationally efficient attention modules : 1) body-sparse self-attention and 2) Deformable Mesh cross-Attention (DMA). Experimental results show that DeFormer is able to efficiently leverage multi-scale feature maps and a dense mesh, which was not possible by previous transformer approaches. As a result, DeFormer achieves state-of-the-art performances on Human3.6M and 3DPW benchmarks. Code is available at https://github.com/yusukey03012/deformer.