論文ID: 2024EDP7220
Deep neural network (DNN) pruning is a popular method for accelerating computations in DNNs by removing unimportant parameters. Among pruning methods, tile-wise pruning (TWP) achieves significant acceleration with minimal pruning loss. However, TWP suffers from load imbalance when important weight elements in the matrices of the DNN are unevenly distributed. To address this issue, we propose adaptive tile pruning (ATP), an integrative solver for building sparse DNNs with controllably balanced workloads. ATP comprises three components: hierarchical tile pruning (HTP), split-tiled sparse matrix multiplication (STSpMM), and adaptive pattern selection (APS). HTP constructs sparse matrices with evenly distributable workloads while preserving DNN model accuracy. STSpMM efficiently handles HTP-generated sparse matrices on GPUs by splitting and redistributing large workloads. APS dynamically selects pruning patterns for HTP and grid sizes for STSpMM based on the problem sizes in the targeted DNN. We evaluated our approach on pruned ResNet-18 and ResNet-34 models using ImageNet, and BERT-Small on the question-answering natural language inference (QNLI) task. Results demonstrate that models accelerated by ATP achieve greater acceleration than previous methods while maintaining accuracy for inference.