Adaptive Tile Pruning for Efficient Inference of the Imbalanced DNNs on GPU

Yanchen LI; Fumihiko INO

doi:10.1587/transinf.2024EDP7220

抄録

Deep neural network (DNN) pruning is a popular method for accelerating computations in DNNs by removing unimportant parameters. Among pruning methods, tile-wise pruning (TWP) achieves significant acceleration with minimal pruning loss. However, TWP suffers from load imbalance when important weight elements in the matrices of the DNN are unevenly distributed. To address this issue, we propose adaptive tile pruning (ATP), an integrative solver for building sparse DNNs with controllably balanced workloads. ATP comprises three components: hierarchical tile pruning (HTP), split-tiled sparse matrix multiplication (STSpMM), and adaptive pattern selection (APS). HTP constructs sparse matrices with evenly distributable workloads while preserving DNN model accuracy. STSpMM efficiently handles HTP-generated sparse matrices on GPUs by splitting and redistributing large workloads. APS dynamically selects pruning patterns for HTP and grid sizes for STSpMM based on the problem sizes in the targeted DNN. We evaluated our approach on pruned ResNet-18 and ResNet-34 models using ImageNet, and BERT-Small on the question-answering natural language inference (QNLI) task. Results demonstrate that models accelerated by ATP achieve greater acceleration than previous methods while maintaining accuracy for inference.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PPV is available from https://globals.ieice.org/en_transactions/information

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）