2024 年 64 巻 6 号 p. 1019-1028
Real-time object detectors deployed on general-purpose graphics processing units (GPUs) or embedded devices allow their mass usage in industrial applications at an affordable cost. However, existing state-of-the-art object detectors are difficult to meet the requirements of high accuracy and low inference latency simultaneously in industrial applications on general-purpose devices. In this work, we propose RDDPA, a fast and accurate defect detection framework. RDDPA adopts a novel end-to-end pruning scheme, which can prune the detection network from scratch and achieve real-time detection on general-purpose devices. Additionally, we have developed a new training scheme to minimize the accuracy loss associated with the pruning process. Experimental results on a standard steel surface defect dataset indicate that our model achieves 79.2% mAP (mean Average Precision) at 103.7 FPS (Frames Per Second) on a single mid-end Titan X GPU as well as 40.1 FPS on a single low-end GTX 960M GPU, and outperforms the state-of-the-art defect detectors by about 20× speedup with considerable or higher accuracy.
Currently, there is a plethora of outstanding research in the field of hot rolled strip surface defect recognition, with increasing accuracy.1,2,3) However, the simple classification algorithm cannot locate the specific location of the defect, nor can it judge the size of the defect, which is not conducive to the statistical analysis of the defect by the factory. Object detection, one of the major tasks in computer vision and pattern recognition, has been widely applied in industry with the great breakthrough of convolutional neural networks (CNNs), especially for automated defect inspection (ADI).4,5,6) Current state-of-the-art object detectors have achieved high accuracy on benchmark datasets when large-scale backbone networks had been applied.7,8,9) However, large networks-based detectors usually require high computational cost. For instance, EfficientDet-D7x has more than 300 billion floating-point operations (FLOPs) and 560 MB of storage space, so it is unaffordable to deploy them on platforms with limited resources such as personal computers, embedded devices and mobile devices.10) To address this issue, lightweight object detection frameworks such as YOLO-Lite and SSD-Lite have been developed to improve detection efficiency.11,12) However, reducing the cost of detection led to a significant decrease in accuracy.
In recent years, model compression techniques, especially the weight pruning method, have been widely considered as one of the most effective ways to reduce the computational cost, memory footprint and storage intensity without sacrificing too much accuracy.13,14,15) By removing huge redundant weights, a model with a smaller scale and lower energy consumption can be efficiently generated. However, traditional model pruning algorithms mostly serve for classification tasks and generally consist of three-stage pipelines, i.e., training with sparsely, pruning, and fine-tuning. This strategy typically involves a cumbersome and time-consuming weight optimization procedure, especially for object detection.16)
In this work, a new one-stage detector has been proposed, which can provide an optimal trade-off between detection accuracy and speed on the task of detection. Specifically, we explore the potential relationship between pruning structure and weight, pruning rate and training scheme respectively, and the pruning structure of the object detection network can be directly obtained from the pre-training weights on ImageNet.17) Furthermore, a straightforward and effective pruning scheme is designed to overcome the challenges of traditional channel pruning that it requires a cumbersome and time-consuming three-stage weight optimization process. To the best of our knowledge, this is the first time that a real end-to-end pruning has been achieved for the object detection network, and the pruned network is easier to deploy on a resource-constrained platforms in terms of model size, runtime memory, and computational operations. In summary, the main contributions of this work are as follows:
1) A new one-stage defect detector has been designed, which achieves high detection accuracy on the task of detection.
2) A concise and efficient end-to-end model pruning pipeline has been proposed, which can greatly compress network parameters for real-time detection.
3) Our proposed model can achieve high detection accuracy and speed simultaneously with only 0.6 M weights and 7.1 MB storage footprint.
Generally, CNN-based object detectors can be roughly divided into two categories: two-stage or one-stage detectors.7,18,19)
2.1.1. Two-stage Object DetectorsThe two-stage object detector utilizes the region proposal network (RPN) to extract the region of interest (ROIs), and then uses R-CNN to perform fine classification and regression boundary location on the object. Faster R-CNN is the first anchor-based object detector and achieves high performance on challenging public benchmark datasets.7) Hao et al. introduced deformable convolution and balanced feature pyramid structure into Faster R-CNN and reported 80.5% mAP on the NEU-DET dataset.20) He et al. added a multilevel-feature fusion network (MFN) module, and realized a high detection accuracy of 82.3% mAP on the same dataset.21) Even if the high accuracy was achieved, the two-stage detection procedure also brought the high computational cost and the latency in inference.
2.1.2. One-stage Object DetectorsA one-stage object detector eliminates RPN and regresses all candidate anchor boxes, e.g., SSD, YOLO, RetinaNet, and ATSS.18,19,22,23) Therefore, these methods can get an optimal trade-off between detection accuracy and speed. Cheng et al. employed RetinaNet with different channel attention and adaptive spatial feature fusion on the NEU-DET dataset, and reported 79.1% mAP at 12.3 FPS.8) Recently, the anchor-free strategy has received extensive attention because it eliminates the anchor-based mechanism in one-stage detectors and further promotes inference acceleration, e.g., FCOS, CenterNet, and TTFNet.9,24,25) However, most one-stage object detectors are only suitable for recommended hardware systems to a large extent, and suffer from high inference latency on general-purpose GPUs. To address this issue, a lightweight one-stage object detection network is proposed because it can be easily deployed on low-end GPUs and even mobile devices, such as YOLO-Lite and SSD-Lite.11,12) Nevertheless, these works are typically at the expense of detection accuracy.
2.2. CNN-based Model PruningMost CNN-based model pruning algorithms can be divided into three categories, i.e., unstructured, pattern-based, and structured pruning.
2.2.1. Unstructured PruningThe unstructured pruning method prunes the weights at any position in the 2-D weight matrix, as shown in Fig. 1(a). These algorithms can reduce the amount of actual computation, so high pruning rate can be achieved due to their high flexibility.26) However, extra indexes will be introduced in the convolution kernel to record the position of non-zero weights, which will reduce the inference speed in GPU/CPU implementation and lead to irregularities in the 2-D weight matrix, resulting in poor hardware parallelism.27,28)
The pattern-based pruning approach analyzes the locations of important weights in each original 3×3 convolution kernel and assigns specific patterns from a predefined fixed-size library, as illustrated in Fig. 1(b). Additionally, the connection pruning strategy removes the entire kernel and the corresponding connection to obtain a higher global compression ratio. Nevertheless, the pattern-based pruning algorithm also suffer from some challenges, which greatly limits its application in certain scenarios. Compiler support is necessary to achieve high inference acceleration because the irregularity issue in the 2-D weight matrix is not completely eliminated.29) Furthermore, its design is specifically tailored for 3×3 convolution kernels, which limits its applicability to other network layers.28,29)
2.2.3. Structured PruningThe structured pruning method has been widely adopted as an effective model compression technique in recent years for its ease of deployment on general-purpose GPU/CPU. Figure 1(c) illustrates the process of removing certain channels at layer k and filtering the corresponding entire column or row of the weight matrix in layer k+1. This pruning strategy directly reduces the network width and preserves the regular shape of the weight matrix. As a result, it is hardware-friendly and can leverage the high hardware parallelism to facilitate the inference acceleration on GPU/CPU implementations. However, the pruned structure usually suffers a noticeable accuracy loss due to the coarse-grained removal of entire filters/channels.30) Therefore, compared to other pruning algorithms mentioned above, fine-tuning is more important for structured pruning to compensate for accuracy.31)
2.3. MotivationAs mentioned above, most of the state-of-the-art object detectors either use large-scale CNNs for accuracy with speed sacrificed or use lightweight CNNs for speed with accuracy compromise. Therefore, it is difficult to meet the stringent requirements of accuracy and latency simultaneously for general industrial applications on general-purpose devices. Although structured pruning seems to be a desirable choice because it can better utilize hardware parallelism to achieve higher inference speed. Nevertheless, the typical three-stage pipeline is a tedious and time-consuming weight optimization procedure for object detection because it needs to traverse normal training, training with sparsity, and fine-tuning. To address the key issues mentioned above, three main objectives are proposed:
Objective 1: A defect detector that can get a high detection accuracy is required.
Objective 2: A concise yet efficient end-to-end model compression method is needed.
Objective 3: A new training scheme is needed to mitigate the accuracy loss.
To achieve objective 1, an anchor-free one-stage detector, RDDPA, which is different from the traditional manual anchor-based methods, is proposed to meet the high requirements of speed and accuracy simultaneously.
The overall architecture of RDDPA is shown in Fig. 2, which mainly includes three parts: Backbone, Neck, and Head. First, the input defect image will be convoluted multiple times by Backbone to extract features. Secondly, in order to obtain a heatmap with high resolution and stronger features, the final feature map extracted by Backbone will undergo deformation convolution and up-sampling through the Neck part. In addition, the three different scale feature maps obtained after up-sampling will be fused with the corresponding scale feature maps extracted by Backbone, so that the network can acquire more accurate location information and semantic understanding, as illustrated by the blue arrow in Fig. 2. Finally, the fused feature map will be forwarded to the detection head for center-point positioning and size regression, and the category and location of the defect will be output.
The center-point location is primarily utilized to predict the location and classification of object defects. To locate the center-point, a Gaussian kernel is adopted to generate a heatmap to make the network produce higher activation near the object center, and the size of the far pixel value tends to 0. Subsequently, the offset of the target center is predicted to recover the discretization error caused by the output stride, thereby completing the positioning of the target center-point. Specifically, given an input image
(1) |
where
To recover the discretization error caused by the pixel position due to the output stride, an offset will be added for each center point:24)
(2) |
where the smooth L1 loss function is adopted for training,
Size regression is mainly used to calculate the position of the detection frame. RDDPA can directly predict the height and width of the object, and then compute the position information of the bounding box according to the center-point of the located object defect. For the annotation box with the coordinate information of (x1c,y1c,x2c,y2c), RDDPA also outputs a feature map of size
(3) |
The detection of instances is completed by minimizing the multi-task loss function which is defined as:
(4) |
where λ is used to balance the Lhm, Loff, and Lsize terms.
In the inference phase, a 3×3 max-pooling operation is utilized to select locations with the highest 8-neighborhood scores as the center point of instances for each image. Suppose the predicted center point, offset, and size of each instance are
(5) |
where k represents a predicted defect instance. Additionally, non-maxima suppression (NMS) which filters out the overlapping boxes is adopted in this work.
3.2. Pruning the RDDPA from Scratch 3.2.1. End-to-end PruningTo achieve objective 2, a typical structured channel pruning algorithm is utilized, where the importance of each channel is identified according to the scale factor value of network pre-trained weights in the BN layer. The channels with smaller weights that cannot generate effective output activation are pruned, and remained connections constitute the detection backbone network as shown in Fig. 3. Specifically, our end-to-end pruning pipeline can be summarized as a 4-step end-to-end pruning algorithm, as shown in Table 1.
Process | 4-Step End-to-end Pruning Algorithm |
---|---|
Input | Structure of Tiny-Darknet Ma, and its weights Wa which pre-trained on ImageNet. |
Step 1 | Scaling factors in the BN layer of Wa are sorted in L1 descending order. |
Step 2 | Set a global pruning threshold which is defined as the value in the p-percent index of the total scaling factors. |
Step 3 | Prune channels below the threshold value, obtaining the network structure Mb and weights Wb. |
Step 4 | Perform channel regularization on Mb and Wb, obtaining structure Mc and weights Wc. |
Output | Combine Mc and Wc as the final model for training. |
In step 3, channel regularization, i.e., the number of channels after pruning is regularly adjusted to a multiple of 16 to facilitate binary operation. Although this operation sacrifices a small amount of pruning rate, it is hardware-friendly and can achieve high hardware parallelism to facilitate acceleration. For example, a convolution layer with the channel number of ‘13’ or ‘15’ will waste part of hardware memory in the inference phase. Besides, Tiny-Darknet is purely pruned, and then channels in the detector neck are adaptively adjusted according to the number of channels in the pruned backbone layers to address the issue of channel mismatch between shortcuts.
3.2.2. Our Solution to Mitigate Accuracy LossDespite the end-to-end training of pruned object detection network from scratch having been achieved, the exposed network architecture is difficult to train. The previous work has shown that the sparser the model, the slower the learning, and the lower test accuracy.26) Traditional pruning methods usually adopt a tedious and time-consuming three-stage weight optimization pipeline to overcome these problems, but seldom consider whether the pruned model is suitable for the original key hyperparameters, such as learning rate, learning rate decay schedule, and training epochs. Previous works have only found that the pruned network can get performance compensation as long as training epochs are increased within a reasonable range, so they simply double the number of base training epochs.13,16) To address the issues mentioned above and achieve objective 3, this work proposes a linear strategy (LS) to get accuracy compensation, which indicates that sparser models require a larger learning rate, learning rate decay schedule, and training epochs to achieve convergence. The formula is expressed as follows:
(6) |
where zin represents the input, such as the initial learning rate, p is the global pruning rate, and zout indicates the output.
To evaluate the effectiveness of our proposed pruning pipeline and RDDPA framework, extensive experiments were implemented on the NEU-DET dataset.33) Experimental results show that our pruning pipeline is more concise and efficient than other representative pruning algorithms, and our proposed RDDPA framework can achieve real-time detection on general-purpose GPUs with considerable or even higher accuracy compared with the state-of-the-art methods.
4.1. NEU-DET DatabaseThe NEU-DET database is a benchmark dataset which contains six kinds of typical surface defects of the hot-rolled steel strip as shown in Fig. 4, i.e., crazing (Cr), inclusion (In), patches (Pa), pitted surface (PS), rolled-in scale (RS), and scratches (Sc).33) Each category contains 300 grayscale images, and 70% of them are randomly chosen for training, and the rest are used for evaluation in this work.
Our models are trained on a single high-end Tesla V100 GPU and tested on a mid-end Titan X GPU and a low-end GTX 960M GPU, using Pytorch 1.3, CUDA 10.1, and cuDNN v9. Herein, the input size of the network is set to 384×384, the batch size is 32, and the loss scalar λ for Lsize is 0.1. In the training phase, random flipping, scaling, and shifting are applied for data augmentation. For the unpruned model, Adam optimizer with an initial learning rate of 2.5×10−4 is adopted, which drops 10 times in the 160th and 210th respectively, and the total number of epochs is 230. In the inference phase, 1.65× input scale, keep the original resolution, and random flipping are applied in the whole experiments.
4.3. Evaluation of Our End-to-end Pruning AlgorithmThe performance of our proposed end-to-end pruning algorithm is evaluated by the area under the precision-recall curve (mAP), inference speed, and model parameters, such as the floating-point operations (FLOPs), multiply-add operations (Madd), memory usage (Memory), storage space (Storage), and frames per second (FPS). For pruned models, both the performance tested on a low-end GTX 960M GPU, and the performance gain (+) or loss (−) are showed with comparison with the full model in different compression ratios.
Obviously, the performance of the model varies upon different compression ratio. Figure 5 shows that a higher compression ratio can better leverage the hardware parallelism to facilitate inference acceleration no matter what kind of GPU it is. However, it can be seen from Figs. 5(a) and 5(b) that when the weights compression ratio exceeds 10.8×, the accuracy loss increases sharply at a slight increase in speed. Even on the low-end GTX 960M GPU, it can still facilitate better acceleration but loses too much detection accuracy. Therefore, the model with the 10.8× compression ratio is considered as the best. Because it achieves the performance similar to the detection accuracy of the unpruned model while taking into account the high reasoning speed.
Table 2 shows that compared with the full model, i.e., RDDPA 1.0×, our compression ratio can reach 3× without any accuracy loss, the weights, FLOPs, Madd, memory usage, and storage space of the model are reduced to 2.2 M, 4.4 G, 8.6 G, 152.5 MB, and 25.5 MB respectively, and the inference speed is increased by 5.4 FPS on a single low-end GTX 960M GPU. The inference speed of our 4.3× compression ratio model is 30.2 FPS and the accuracy is only 0.2% mAP loss, which satisfies real-time detection on a single low-end GPU. Moreover, our 10.8× compression ratio model achieves 79.2% mAP at 40.1 FPS with only 0.7% mAP loss and 18.5 FPS increase in speed, and the model only contains 0.6 M weights and requires 96.8 MB memory usage and 7.1 MB storage space. Therefore, the resource occupation is greatly reduced, which makes it usable for applications in which real-time detections are needed.
Model | Params | FLOPs | Madd | Memory | Storage | FPS | mAP (%) |
---|---|---|---|---|---|---|---|
RDDPA 1.0× | 6.7 M | 6.2 G | 12.3 G | 160.5 MB | 76.2 MB | 21.6 | 79.9 |
RDDPA 3.0× | 2.2 M | 4.4 G | 8.6 G | 152.5 MB | 25.5 MB | 27.0/+5.4 | 79.9/+0.0 |
RDDPA 4.3× | 1.6 M | 3.5 G | 6.9 G | 119.4 MB | 17.9 MB | 30.2/+8.6 | 79.7/−0.2 |
RDDPA 10.8× | 0.6 M | 1.8 G | 3.6 G | 96.8 MB | 7.1 MB | 40.1/+18.5 | 79.2/−0.7 |
In what follows, our end-to-end pruning approach is compared with other representative structured pruning methods at the same compression ratio, including network slimming (NS)30) and rethinking the value of network pruning (Rethink).13) NS imposes L1 sparsity on the scaling factor of the channel for sparse training, then prunes the channel with a smaller scaling factor, and finally mitigates the accuracy loss through fine-tuning. In this work, we follow the NS method to sparsely train 230 epochs on the NEU-DET dataset and fine-tune the pruned model 150 epochs with a learning rate of 2.5×10−4. Rethink indicates that fine-tuning the pruned model with inherited weights is not better than training it from scratch.13) Therefore, the pre-training weights that achieve 79.9% mAP on the NEU-DET dataset have been pruned directly, and two training strategies, namely Scratch-B and Scratch-E in the Rethink method, are adopted in this work. Scratch-E denotes training the pruned models for the same parameter settings while Scratch-B denotes doubling the number of epochs for training. pruning strategy has been adopted for both of these methods, i.e., the layer associated with the shortcut is not pruned. Meanwhile, in order to maintain the accuracy of the model and reduce the number of parameters, NS and Rethink can only compress the number of parameters of the model to three times. Thus, for the sake of fairness, a compression ratio of 3.0× is also set up to assess the effectiveness of our proposed model.
Table 3 shows that compared to the NS and Rethink methods, our method achieves 79.9% mAP at the 3× compression ratio, which costs a relatively small training budget (395 epochs) to obtain the same accuracy as the unpruned full model. Although NS only requires 380 epochs, its detection accuracy is 3.8% lower than that of our method while additional sparse training and fine-tuning stages are required. Moreover, our method implements end-to-end pruning in a real sense for object detection without pre-training weights, sparse training, and fine-tuning stages on the NEU-DET dataset.
Approach | Compression Ratio | Pretrained (ImageNet) | Pretrained (NEU‑DET) | Sparsity | Fine‑tuned | Epochs | mAP (%) | ΔAcc (%) |
---|---|---|---|---|---|---|---|---|
NS | 3.0× | ✓ | ✗ | ✓ | ✓ | 380 | 76.1 | −3.8 |
Rethink | 3.0× | ✓ | ✓ | ✗ | ✗ | 460 | 77.1 | −2.8 |
Rethink* | 3.0× | ✓ | ✓ | ✗ | ✗ | 690 | 77.7 | −2.2 |
Ours | 3.0× | ✓ | ✗ | ✗ | ✗ | 395 | 79.9 | −0.0 |
To verify the effectiveness of our method, several representative two-stage object detectors, e.g., Faster R-CNN,7) Cascade R-CNN,34) and DDN,21) and one-stage object detectors, e.g., ATSS,23) FCOS,9) SSD,18) CenterNet,24) DEA_RetinaNet,8) and RDN36) have been compared with our proposed RDDPA framework on the same GPUs.
As shown in Table 4, RDDPA 10.8× achieves 79.2% mAP at 103.7 FPS on the Titan X GPU and 40.1 FPS on the GTX 960M GPU with only 0.6 M Weights, 1.8 GFLOPs, which is higher mAP than the representative two-stage detectors Cascade R-CNN and Faster R-CNN but lower mAP than DDN-ResNet50. However, the inference speed of our RDDPA on a single low-end GPU or a single mid-end GPU is much faster than Cascade R-CNN, Faster R-CNN, and DDN. On the mid-end Titan X GPU, the inference speed of RDDPA runs nearly 10× faster than Cascade R-CNN and DDN-ResNet50, and about 5× faster than Faster R-CNN and DDN-ResNet34. On the low-end GTX 960M GPU, RDDPA runs 22× faster than DDN-ResNet50 with comparable performance. Compared with the well-known one-stage detectors, the performance of RDDPA is better than all detectors, and the inference speed of RDDPA is much faster than ATSS, FCOS, SSD, CenterNet, and RDN (about 6.7×, 5×, 3.2×, 2.5×, 1.6× speedup on Titan X GPU and 18.2×, 12.5×, 9.1×, 5.2×, 2.5× speedup on GTX 960M GPU respectively).
Approach | Backbone | Params | FLOPs | FPS* | FPS | mAP (%) |
---|---|---|---|---|---|---|
two-stage: | ||||||
Cascade R-CNN | ResNet50 | 68.9 M | 350.1 G | 11.5 | 2.2 | 74.3 |
Faster R-CNN | ResNet50 | 41.2 M | 322.3 G | 20.7 | 4.2 | 77.5 |
DDN | ResNet34 | 28.2 M | – | 17.1 | 3.3 | 74.8 |
DDN | ResNet50 | 97.0 M | – | 11.0 | 1.8 | 82.3 |
one-stage: | ||||||
ATSS | ResNet50 | 32.1 M | – | 20.0 | 3.0 | 63.4 |
ATSS | ResNet101 | 50.0 M | – | 15.4 | 2.2 | 67.8 |
FCOS | ResNet50 | 31.9 M | 315.0 G | 20.0 | 3.2 | 71.3 |
SSD300 | VGG16 | 24.4 M | 30.7 G | 32.0 | 4.4 | 74.8 |
CenterNet | DLA34 | 19.7 M | 17.8 G | 41.5 | 7.7 | 77.1 |
DEA_RetinaNet | ResNet50 | 42.2 M | 105.3 G | – | – | 79.1 |
RDN | ResNet18-dsf | 24.0 M | 6.89 G | 64.0 | 15.9 | 80.0 |
Ours: | ||||||
RDDPA 1.0× | Tiny-Darknet | 6.7 M | 6.2 G | 85.6 | 21.6 | 79.9 |
RDDPA 3.0× | Tiny-Darknet | 2.2 M | 4.4 G | 92.9 | 27.0 | 79.9 |
RDDPA 4.3× | Tiny-Darknet | 1.6 M | 3.5 G | 98.0 | 30.2 | 79.7 |
RDDPA 10.8× | Tiny-Darknet | 0.6 M | 1.8 G | 103.7 | 40.1 | 79.2 |
Note that FPS* indicates frames per second tested on a mid-end Titan X GPU while FPS is tested on a low-end GTX 960M GPU.
Moreover, it can be seen from Table 4 and Fig. 6 that even a small number of one-stage detectors can perform real-time detection on a mid-end GPU, i.e., FCOS, SSD, CenterNet, and RDN. However, the detection accuracy of our RDDPA is much higher than CenterNet, SSD, and FCOS. Compared with RDN, our proposed RDDPA 1.0× achieves a reduction in both the Params and FLOPs by 17.3 M and 0.69 G, respectively, while incurring only a minimal 0.1% loss in mAP. In contrast to RDDPA 10.8×, while RDN has demonstrated a 0.8% improvement in mAP, the Params has escalated by 40 times, and the FLOPs have increased by 5.09 G. On a low-end GPU, only our RDDPA 4.3× and RDDPA 10.8× can realize real-time defect detection with high detection performance on the NEU-DET dataset. In other words, the huge inspection cost has been effectively reduced in this work.
The traditional CNN-based pruning pipeline contains three training stages. After all, it is generally deemed that training a sparse model on the benchmark dataset is particularly important for pruning because it provides highly representative weights inherited from the original network, such as network slimming30) and AutoPruner.35) In what follows, this work will show that the effect of sparsely trained weights on the NEU-DET dataset is quite different from what we previously thought in the common network pruning pipeline. In the light of this surprising observation, a novel pruning pipeline is presented where the pruned structure can be obtained directly from the pre-training weights on ImageNet, therefore the end-to-end pruning for detection network can be achieved. Specifically, the weights of the epochs at different sparse training phases (interval is 30) have been saved, and then they are used as the initial weights of the network to explore whether the sparse training procedure is crucial to the final pruned structure.
From Fig. 7, it can be seen that whether pruning the weights which are pre-trained on ImageNet or after sparse training, the network structures obtained are homogeneous no matter what the pruning rate is. Therefore, two conclusions can be drawn from this phenomenon. First, the weights and their sparsity that are pre-trained on such a large-scale ImageNet benchmark are more representative and have strong generalization ability. Second, sparse training on the NEU-DET dataset has no effect on the final pruned network structure, while the pre-training weights on ImageNet directly affect the final pruned network structure. Therefore, sparse training is removed in our work, and there are three benefits that this pruning approach can bring. First, high channel compression ratios can be achieved without taking the issue that pruning can cause the channel mismatch in shortcuts between the backbone network and the neck into account, which is seldom addressed in previous channel pruning solutions. Second, the task of pruning the detection network is simplified to pruning the classification network, so it is conducive to the realization of end-to-end pruning, i.e., training the pruned detection network from scratch to get rid of the full pruning pipeline. Third, this approach has not introduced additional parameters or indexes, thus this pruning method will greatly facilitate training and inference acceleration.
Extensive experiments are also conducted to explore the relationship between learning rate and pruning rate. Specifically, the learning rate under a specific pruning rate has been gradually increased to visualize the convergence effect of the training loss. It can be seen from Fig. 8(a) that with the increase of learning rate, the benefits brought by loss convergence first gradually increase, and then decrease sharply. Moreover, with pruning rate increases (see Figs. 8(b)–8(c)), the benefits from increasing the learning rate turn to a sharp increase and then a gradual decrease. This phenomenon indicates that models with higher pruning rates require a larger learning rate to achieve better convergence. For example, the pruning rates of 0.34, 0.43, and 0.55 correspond to the best learning rates of 1.5×, 1.6×, and 2.2×, respectively. In Fig. 8(d), the potential relationship between learning rate and pruning rate is regarded as a 2D point (p, n), where p is the pruning rate, n is the magnification of the initial learning rate, and three optimal values under a specific pruning rate have been chosen in this work. As shown in Fig. 8(d), with the increase of pruning rate, the corresponding optimal learning rate also increases. Therefore, a curve is leveraged to approximately fit this distribution, which is defined as LS in this work to get accuracy compensation.
In this work, it is found in practice that pruned models with higher compression ratios need a longer training stride to enable the network to converge better. Therefore, Eq. (6) is adopted to get the corresponding training epochs, which amounts to a similar computation budget as the full model training. Empirically, this training scheme is essential to improve the performance of the pruned models. As shown in Fig. 9 (Right), the loss curve of the training with LS is closer to that of the full network, which can effectively compensate for the accuracy loss caused by pruning.
Table 5 also demonstrates that the benefits of the LS increase as the pruning rate increases. For example, the performance gain it brings is 0.9% mAP at a 3× compression ratio and 2.4% mAP at a 10.8× compression ratio. Therefore, LS is critical for the pruned models with a large pruning rate to achieve similar performance to the full network in this work.
Compression Ratio | mAP (w/o) | mAP (w/) | Δ |
---|---|---|---|
1.0× | 79.9% | 79.9% | +0.0% |
3.0× | 79.0% | 79.9% | +0.9% |
4.3× | 78.1% | 79.7% | +1.6% |
10.8× | 76.8% | 79.2% | +2.4% |
This work proposes RDDPA, a real-time defect detection framework that can be deployed on a single general-purpose device via a novel and straightforward pruning scheme, which can address the issue of the traditional three-stage pruning pipeline to realize end-to-end pruning in a real sense. Extensive experiments have evaluated that our method is concise and efficient, and can realize real-time defect detection on a single low-end GPU with high accuracy by removing huge redundant weights.
Although our models have generally achieved good results on the NEU-DET dataset, the mAP of the best model can only reach 79.9%, consequently, there is still much room for improvement. As future work, one direction is to improve the detection accuracy using data augmentation technology or stronger backbone networks. Another investigation is to perform the defect segmentation task which can produce a more precise defect boundary. Additionally, extending RDDPA to other industrial application scenarios would be a promising research topic.
This work was supported by the National Natural Science Foundation of China (Nos. 62172004, 62072002, and 61872004). Anhui Province Collaborative Innovation Project (Nos. GXXT-2022-050, GXXT-2022-053). Educational Commission of Anhui Province (No. 2022AH050336).