IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
A Fine-aware Vision Transformer for Precision Grasp Pose Detection
Trung MINH BUIJung-Hoon HWANGSewoong JUNWonha KIMDongIn SHIN
著者情報
ジャーナル フリー 早期公開

論文ID: 2024EDP7261

詳細
抄録

This paper develops a grasp pose detection method that achieves high success rates in real-world industrial environments where elongated objects are densely cluttered. Conventional Vision Transformer (ViT)-based methods capture fused feature maps, which successfully encode comprehensive global object layouts, but these methods often suffer from spatial detail reduction. Therefore, they predict grasp poses that could efficiently avoid collisions, but are insufficiently precisely located. Motivated by these observations, we propose Oriented Region-based Vision Transformer (OR-ViT), a network that preserves critical spatial details by extracting a fine-grained feature map directly from the shallowest layer of a ViT backbone and also understands global object layouts by capturing the fused feature map. OR-ViT decodes precise grasp pose locations from the fine-grained feature map and integrates this information into its understanding of global object layouts from the fused map. In this way, the OR-ViT is able to predict accurate grasp pose locations with reduced collision probabilities.

Extensive experiments on the public Cornell and Jacquard datasets, as well as on our customized elongated-object dataset, verify that OR-ViT achieves competitive performance on both public and customized datasets when compared to state-of-the-art methods.

著者関連情報
© 2025 The Institute of Electronics, Information and Communication Engineers
前の記事 次の記事
feedback
Top