A Fine-aware Vision Transformer for Precision Grasp Pose Detection

Trung MINH BUI; Jung-Hoon HWANG; Sewoong JUN; Wonha KIM; DongIn SHIN

doi:10.1587/transinf.2024EDP7261

抄録

This paper develops a grasp pose detection method that achieves high success rates in real-world industrial environments where elongated objects are densely cluttered. Conventional Vision Transformer (ViT)-based methods capture fused feature maps, which successfully encode comprehensive global object layouts, but these methods often suffer from spatial detail reduction. Therefore, they predict grasp poses that could efficiently avoid collisions, but are insufficiently precisely located. Motivated by these observations, we propose Oriented Region-based Vision Transformer (OR-ViT), a network that preserves critical spatial details by extracting a fine-grained feature map directly from the shallowest layer of a ViT backbone and also understands global object layouts by capturing the fused feature map. OR-ViT decodes precise grasp pose locations from the fine-grained feature map and integrates this information into its understanding of global object layouts from the fused map. In this way, the OR-ViT is able to predict accurate grasp pose locations with reduced collision probabilities.

Extensive experiments on the public Cornell and Jacquard datasets, as well as on our customized elongated-object dataset, verify that OR-ViT achieves competitive performance on both public and customized datasets when compared to state-of-the-art methods.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PPV is available from https://globals.ieice.org/en_transactions/information

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）