IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
A Fine-aware Vision Transformer for Precision Grasp Pose Detection
Trung MINH BUIJung-Hoon HWANGSewoong JUNWonha KIMDongIn SHIN
Author information
JOURNAL FREE ACCESS Advance online publication

Article ID: 2024EDP7261

Details
Abstract

This paper develops a grasp pose detection method that achieves high success rates in real-world industrial environments where elongated objects are densely cluttered. Conventional Vision Transformer (ViT)-based methods capture fused feature maps, which successfully encode comprehensive global object layouts, but these methods often suffer from spatial detail reduction. Therefore, they predict grasp poses that could efficiently avoid collisions, but are insufficiently precisely located. Motivated by these observations, we propose Oriented Region-based Vision Transformer (OR-ViT), a network that preserves critical spatial details by extracting a fine-grained feature map directly from the shallowest layer of a ViT backbone and also understands global object layouts by capturing the fused feature map. OR-ViT decodes precise grasp pose locations from the fine-grained feature map and integrates this information into its understanding of global object layouts from the fused map. In this way, the OR-ViT is able to predict accurate grasp pose locations with reduced collision probabilities.

Extensive experiments on the public Cornell and Jacquard datasets, as well as on our customized elongated-object dataset, verify that OR-ViT achieves competitive performance on both public and customized datasets when compared to state-of-the-art methods.

Content from these authors
© 2025 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top