RDDPA: Real-time Defect Detection via Pruning Algorithm on Steel Surface

Kun Lu; Xuejuan Pan; Chunfeng Mi; Wenyan Wang; Jun Zhang; Peng Chen; Bing Wang

doi:10.2355/isijinternational.ISIJINT-2023-360

Abstract

Real-time object detectors deployed on general-purpose graphics processing units (GPUs) or embedded devices allow their mass usage in industrial applications at an affordable cost. However, existing state-of-the-art object detectors are difficult to meet the requirements of high accuracy and low inference latency simultaneously in industrial applications on general-purpose devices. In this work, we propose RDDPA, a fast and accurate defect detection framework. RDDPA adopts a novel end-to-end pruning scheme, which can prune the detection network from scratch and achieve real-time detection on general-purpose devices. Additionally, we have developed a new training scheme to minimize the accuracy loss associated with the pruning process. Experimental results on a standard steel surface defect dataset indicate that our model achieves 79.2% mAP (mean Average Precision) at 103.7 FPS (Frames Per Second) on a single mid-end Titan X GPU as well as 40.1 FPS on a single low-end GTX 960M GPU, and outperforms the state-of-the-art defect detectors by about 20× speedup with considerable or higher accuracy.

1. Introduction

Currently, there is a plethora of outstanding research in the field of hot rolled strip surface defect recognition, with increasing accuracy.^1,2,3) However, the simple classification algorithm cannot locate the specific location of the defect, nor can it judge the size of the defect, which is not conducive to the statistical analysis of the defect by the factory. Object detection, one of the major tasks in computer vision and pattern recognition, has been widely applied in industry with the great breakthrough of convolutional neural networks (CNNs), especially for automated defect inspection (ADI).^4,5,6) Current state-of-the-art object detectors have achieved high accuracy on benchmark datasets when large-scale backbone networks had been applied.^7,8,9) However, large networks-based detectors usually require high computational cost. For instance, EfficientDet-D7x has more than 300 billion floating-point operations (FLOPs) and 560 MB of storage space, so it is unaffordable to deploy them on platforms with limited resources such as personal computers, embedded devices and mobile devices.¹⁰⁾ To address this issue, lightweight object detection frameworks such as YOLO-Lite and SSD-Lite have been developed to improve detection efficiency.^11,12) However, reducing the cost of detection led to a significant decrease in accuracy.

In recent years, model compression techniques, especially the weight pruning method, have been widely considered as one of the most effective ways to reduce the computational cost, memory footprint and storage intensity without sacrificing too much accuracy.^13,14,15) By removing huge redundant weights, a model with a smaller scale and lower energy consumption can be efficiently generated. However, traditional model pruning algorithms mostly serve for classification tasks and generally consist of three-stage pipelines, i.e., training with sparsely, pruning, and fine-tuning. This strategy typically involves a cumbersome and time-consuming weight optimization procedure, especially for object detection.¹⁶⁾

In this work, a new one-stage detector has been proposed, which can provide an optimal trade-off between detection accuracy and speed on the task of detection. Specifically, we explore the potential relationship between pruning structure and weight, pruning rate and training scheme respectively, and the pruning structure of the object detection network can be directly obtained from the pre-training weights on ImageNet.¹⁷⁾ Furthermore, a straightforward and effective pruning scheme is designed to overcome the challenges of traditional channel pruning that it requires a cumbersome and time-consuming three-stage weight optimization process. To the best of our knowledge, this is the first time that a real end-to-end pruning has been achieved for the object detection network, and the pruned network is easier to deploy on a resource-constrained platforms in terms of model size, runtime memory, and computational operations. In summary, the main contributions of this work are as follows:

1) A new one-stage defect detector has been designed, which achieves high detection accuracy on the task of detection.

2) A concise and efficient end-to-end model pruning pipeline has been proposed, which can greatly compress network parameters for real-time detection.

3) Our proposed model can achieve high detection accuracy and speed simultaneously with only 0.6 M weights and 7.1 MB storage footprint.

2. Related Work

2.1. Preliminaries on CNN-based Object Detection

Generally, CNN-based object detectors can be roughly divided into two categories: two-stage or one-stage detectors.^7,18,19)

2.1.1. Two-stage Object Detectors

The two-stage object detector utilizes the region proposal network (RPN) to extract the region of interest (ROIs), and then uses R-CNN to perform fine classification and regression boundary location on the object. Faster R-CNN is the first anchor-based object detector and achieves high performance on challenging public benchmark datasets.⁷⁾ Hao et al. introduced deformable convolution and balanced feature pyramid structure into Faster R-CNN and reported 80.5% mAP on the NEU-DET dataset.²⁰⁾ He et al. added a multilevel-feature fusion network (MFN) module, and realized a high detection accuracy of 82.3% mAP on the same dataset.²¹⁾ Even if the high accuracy was achieved, the two-stage detection procedure also brought the high computational cost and the latency in inference.

2.1.2. One-stage Object Detectors

A one-stage object detector eliminates RPN and regresses all candidate anchor boxes, e.g., SSD, YOLO, RetinaNet, and ATSS.^18,19,22,23) Therefore, these methods can get an optimal trade-off between detection accuracy and speed. Cheng et al. employed RetinaNet with different channel attention and adaptive spatial feature fusion on the NEU-DET dataset, and reported 79.1% mAP at 12.3 FPS.⁸⁾ Recently, the anchor-free strategy has received extensive attention because it eliminates the anchor-based mechanism in one-stage detectors and further promotes inference acceleration, e.g., FCOS, CenterNet, and TTFNet.^9,24,25) However, most one-stage object detectors are only suitable for recommended hardware systems to a large extent, and suffer from high inference latency on general-purpose GPUs. To address this issue, a lightweight one-stage object detection network is proposed because it can be easily deployed on low-end GPUs and even mobile devices, such as YOLO-Lite and SSD-Lite.^11,12) Nevertheless, these works are typically at the expense of detection accuracy.

2.2. CNN-based Model Pruning

Most CNN-based model pruning algorithms can be divided into three categories, i.e., unstructured, pattern-based, and structured pruning.

2.2.1. Unstructured Pruning

The unstructured pruning method prunes the weights at any position in the 2-D weight matrix, as shown in Fig. 1(a). These algorithms can reduce the amount of actual computation, so high pruning rate can be achieved due to their high flexibility.²⁶⁾ However, extra indexes will be introduced in the convolution kernel to record the position of non-zero weights, which will reduce the inference speed in GPU/CPU implementation and lead to irregularities in the 2-D weight matrix, resulting in poor hardware parallelism.^27,28)

Fig. 1. Illustration of different weight pruning schemes using 4D tensor and 2D matrix representation. (a) Unstructured pruning; (b) Pattern-based pruning; (c) Structured pruning. (Online version in color.)

2.2.2. Pattern-based Pruning

The pattern-based pruning approach analyzes the locations of important weights in each original 3×3 convolution kernel and assigns specific patterns from a predefined fixed-size library, as illustrated in Fig. 1(b). Additionally, the connection pruning strategy removes the entire kernel and the corresponding connection to obtain a higher global compression ratio. Nevertheless, the pattern-based pruning algorithm also suffer from some challenges, which greatly limits its application in certain scenarios. Compiler support is necessary to achieve high inference acceleration because the irregularity issue in the 2-D weight matrix is not completely eliminated.²⁹⁾ Furthermore, its design is specifically tailored for 3×3 convolution kernels, which limits its applicability to other network layers.^28,29)

2.2.3. Structured Pruning

The structured pruning method has been widely adopted as an effective model compression technique in recent years for its ease of deployment on general-purpose GPU/CPU. Figure 1(c) illustrates the process of removing certain channels at layer k and filtering the corresponding entire column or row of the weight matrix in layer k+1. This pruning strategy directly reduces the network width and preserves the regular shape of the weight matrix. As a result, it is hardware-friendly and can leverage the high hardware parallelism to facilitate the inference acceleration on GPU/CPU implementations. However, the pruned structure usually suffers a noticeable accuracy loss due to the coarse-grained removal of entire filters/channels.³⁰⁾ Therefore, compared to other pruning algorithms mentioned above, fine-tuning is more important for structured pruning to compensate for accuracy.³¹⁾

2.3. Motivation

As mentioned above, most of the state-of-the-art object detectors either use large-scale CNNs for accuracy with speed sacrificed or use lightweight CNNs for speed with accuracy compromise. Therefore, it is difficult to meet the stringent requirements of accuracy and latency simultaneously for general industrial applications on general-purpose devices. Although structured pruning seems to be a desirable choice because it can better utilize hardware parallelism to achieve higher inference speed. Nevertheless, the typical three-stage pipeline is a tedious and time-consuming weight optimization procedure for object detection because it needs to traverse normal training, training with sparsity, and fine-tuning. To address the key issues mentioned above, three main objectives are proposed:

Objective 1: A defect detector that can get a high detection accuracy is required.

Objective 2: A concise yet efficient end-to-end model compression method is needed.

Objective 3: A new training scheme is needed to mitigate the accuracy loss.

3. Method

3.1. Overall Framework of RDDPA

To achieve objective 1, an anchor-free one-stage detector, RDDPA, which is different from the traditional manual anchor-based methods, is proposed to meet the high requirements of speed and accuracy simultaneously.

The overall architecture of RDDPA is shown in Fig. 2, which mainly includes three parts: Backbone, Neck, and Head. First, the input defect image will be convoluted multiple times by Backbone to extract features. Secondly, in order to obtain a heatmap with high resolution and stronger features, the final feature map extracted by Backbone will undergo deformation convolution and up-sampling through the Neck part. In addition, the three different scale feature maps obtained after up-sampling will be fused with the corresponding scale feature maps extracted by Backbone, so that the network can acquire more accurate location information and semantic understanding, as illustrated by the blue arrow in Fig. 2. Finally, the fused feature map will be forwarded to the detection head for center-point positioning and size regression, and the category and location of the defect will be output.

Fig. 2. Architecture of our proposed RDDPA. The feature map with a high resolution used for location and regression tasks is obtained by an encoder-decoder structure, and the blue arrows indicate multi-scale feature fusions. (Online version in color.)

3.1.1. Center-point Location

The center-point location is primarily utilized to predict the location and classification of object defects. To locate the center-point, a Gaussian kernel is adopted to generate a heatmap to make the network produce higher activation near the object center, and the size of the far pixel value tends to 0. Subsequently, the offset of the target center is predicted to recover the discretization error caused by the output stride, thereby completing the positioning of the target center-point. Specifically, given an input image I∈ R W×H×3 with the height W and width H, RDDPA predicts features H ˆ ∈ R c× W r × H r and O ˆ ∈ R 2× W r × H r , respectively. H ˆ predicts the center-point location of the object, and O ˆ is leveraged to predict the offset of the center-point, where c is the number of categories, and r is the down-sampling stride of the output. For instance, the center location of an object is H^c after the down-sampling, the value on the heatmap can be generated by the Gaussian kernel exp( - (X- H c ) 2 2 σ 2 ) , where σ is the adaptive standard deviation of object size, and the previous work shows that a bounding box at least 0.3 IoU (Intersection over Union) with ground truth can be generated at any position within the Gaussian region.³²⁾ Moreover, a variant of focal loss²²⁾ is adopted as a training objective function to punish points far from the center of the Gaussian kernel:

L hm =- 1 N ∑ i,j,c { (1- H ˆ i,j c ) α log( H ˆ i,j c ) if H i,j c =1 (1- H i,j c ) β ( H ˆ i,j c ) α log(1- H ˆ i,j c ) otherwise

(1)

where H ˆ i,j c denotes the score of category c at the position (i,j), and H i,j c is the value of the ground-truth center point. In this work, α, β is set to 2, 4 respectively as Law and Deng’s study did.³²⁾

To recover the discretization error caused by the pixel position due to the output stride, an offset will be added for each center point:²⁴⁾

L off = 1 N ∑ i,j,c smooth L 1 ( O ˆ i,j c -( p i,j c r - H c ) )

(2)

where the smooth L₁ loss function is adopted for training, O ˆ i,j c denotes the predicted center offset, p i,j c indicates the ground-truth center point location, and H^c is ⌊ p i,j c r ⌋ which roughly represents the center point on the output feature map.

3.1.2. Size Regression

Size regression is mainly used to calculate the position of the detection frame. RDDPA can directly predict the height and width of the object, and then compute the position information of the bounding box according to the center-point of the located object defect. For the annotation box with the coordinate information of (x₁^c,y₁^c,x₂^c,y₂^c), RDDPA also outputs a feature map of size S ˆ ∈ R 2× W r × H r to regress the instance size. Assuming that the actual size of the defect is s^c=(x₁^c−x₂^c,y₁^c−y₂^c), the loss function can be expressed as:

L size = 1 N ∑ i,j,c smooth L 1 ( S ˆ - s c )

(3)

The detection of instances is completed by minimizing the multi-task loss function which is defined as:

L(i,j,c)= L wh + L off +λ L size

(4)

where λ is used to balance the L_hm, L_off, and L_size terms.

In the inference phase, a 3×3 max-pooling operation is utilized to select locations with the highest 8-neighborhood scores as the center point of instances for each image. Suppose the predicted center point, offset, and size of each instance are H ˆ k c =( x ˆ k c , y ˆ k c ) , O ˆ k c =(δ x ˆ k c ,δ y ˆ k c ) , and S ˆ k =( w ˆ k , h ˆ k ) , respectively, thereby a bounding box is generated as:

( x ˆ k c +δ x ˆ k c - w ˆ k 2 , y ˆ k c +δ y ˆ k c - h ˆ k 2 , x ˆ k c +δ x ˆ k c + w ˆ k 2 , y ˆ k c +δ y ˆ k c + h ˆ k 2 )

(5)

where k represents a predicted defect instance. Additionally, non-maxima suppression (NMS) which filters out the overlapping boxes is adopted in this work.

3.2. Pruning the RDDPA from Scratch

3.2.1. End-to-end Pruning

To achieve objective 2, a typical structured channel pruning algorithm is utilized, where the importance of each channel is identified according to the scale factor value of network pre-trained weights in the BN layer. The channels with smaller weights that cannot generate effective output activation are pruned, and remained connections constitute the detection backbone network as shown in Fig. 3. Specifically, our end-to-end pruning pipeline can be summarized as a 4-step end-to-end pruning algorithm, as shown in Table 1.

Fig. 3. Framework of structured pruning. Each channel in the convolution layer is assigned a scaling factor from the Batch Normalization (BN) layer, and certain channels with a smaller value of the scaling factor will be identified as unimportant channels for pruning. (Online version in color.)

Table 1. The process of 4-step end-to-end pruning algorithm.

Process	4-Step End-to-end Pruning Algorithm
Input	Structure of Tiny-Darknet M_a, and its weights W_a which pre-trained on ImageNet.
Step 1	Scaling factors in the BN layer of W_a are sorted in L₁ descending order.
Step 2	Set a global pruning threshold which is defined as the value in the p-percent index of the total scaling factors.
Step 3	Prune channels below the threshold value, obtaining the network structure M_b and weights W_b.
Step 4	Perform channel regularization on M_b and W_b, obtaining structure M_c and weights W_c.
Output	Combine M_c and W_c as the final model for training.

In step 3, channel regularization, i.e., the number of channels after pruning is regularly adjusted to a multiple of 16 to facilitate binary operation. Although this operation sacrifices a small amount of pruning rate, it is hardware-friendly and can achieve high hardware parallelism to facilitate acceleration. For example, a convolution layer with the channel number of ‘13’ or ‘15’ will waste part of hardware memory in the inference phase. Besides, Tiny-Darknet is purely pruned, and then channels in the detector neck are adaptively adjusted according to the number of channels in the pruned backbone layers to address the issue of channel mismatch between shortcuts.

3.2.2. Our Solution to Mitigate Accuracy Loss

Despite the end-to-end training of pruned object detection network from scratch having been achieved, the exposed network architecture is difficult to train. The previous work has shown that the sparser the model, the slower the learning, and the lower test accuracy.²⁶⁾ Traditional pruning methods usually adopt a tedious and time-consuming three-stage weight optimization pipeline to overcome these problems, but seldom consider whether the pruned model is suitable for the original key hyperparameters, such as learning rate, learning rate decay schedule, and training epochs. Previous works have only found that the pruned network can get performance compensation as long as training epochs are increased within a reasonable range, so they simply double the number of base training epochs.^13,16) To address the issues mentioned above and achieve objective 3, this work proposes a linear strategy (LS) to get accuracy compensation, which indicates that sparser models require a larger learning rate, learning rate decay schedule, and training epochs to achieve convergence. The formula is expressed as follows:

z out = z in 1-p

(6)

where z_in represents the input, such as the initial learning rate, p is the global pruning rate, and z_out indicates the output.

4. Experiment and Discussion

To evaluate the effectiveness of our proposed pruning pipeline and RDDPA framework, extensive experiments were implemented on the NEU-DET dataset.³³⁾ Experimental results show that our pruning pipeline is more concise and efficient than other representative pruning algorithms, and our proposed RDDPA framework can achieve real-time detection on general-purpose GPUs with considerable or even higher accuracy compared with the state-of-the-art methods.

4.1. NEU-DET Database

The NEU-DET database is a benchmark dataset which contains six kinds of typical surface defects of the hot-rolled steel strip as shown in Fig. 4, i.e., crazing (Cr), inclusion (In), patches (Pa), pitted surface (PS), rolled-in scale (RS), and scratches (Sc).³³⁾ Each category contains 300 grayscale images, and 70% of them are randomly chosen for training, and the rest are used for evaluation in this work.

Fig. 4. Image samples on the NEU-DET dataset. The box with color is the ground-true defect region and the text on the box corresponds to the category of defect belongs. Each image may contain multiple defect regions of a single category (e.g. (b)) or multiclass defect regions (e.g. (c)). (Online version in color.)

4.2. Implementation Parameters

Our models are trained on a single high-end Tesla V100 GPU and tested on a mid-end Titan X GPU and a low-end GTX 960M GPU, using Pytorch 1.3, CUDA 10.1, and cuDNN v9. Herein, the input size of the network is set to 384×384, the batch size is 32, and the loss scalar λ for L_size is 0.1. In the training phase, random flipping, scaling, and shifting are applied for data augmentation. For the unpruned model, Adam optimizer with an initial learning rate of 2.5×10⁻⁴ is adopted, which drops 10 times in the 160th and 210th respectively, and the total number of epochs is 230. In the inference phase, 1.65× input scale, keep the original resolution, and random flipping are applied in the whole experiments.

4.3. Evaluation of Our End-to-end Pruning Algorithm

The performance of our proposed end-to-end pruning algorithm is evaluated by the area under the precision-recall curve (mAP), inference speed, and model parameters, such as the floating-point operations (FLOPs), multiply-add operations (Madd), memory usage (Memory), storage space (Storage), and frames per second (FPS). For pruned models, both the performance tested on a low-end GTX 960M GPU, and the performance gain (+) or loss (−) are showed with comparison with the full model in different compression ratios.

Obviously, the performance of the model varies upon different compression ratio. Figure 5 shows that a higher compression ratio can better leverage the hardware parallelism to facilitate inference acceleration no matter what kind of GPU it is. However, it can be seen from Figs. 5(a) and 5(b) that when the weights compression ratio exceeds 10.8×, the accuracy loss increases sharply at a slight increase in speed. Even on the low-end GTX 960M GPU, it can still facilitate better acceleration but loses too much detection accuracy. Therefore, the model with the 10.8× compression ratio is considered as the best. Because it achieves the performance similar to the detection accuracy of the unpruned model while taking into account the high reasoning speed.

Fig. 5. Accuracy (mAP) and speed (FPS) tested on mid-end and low-end GPUs under different compression ratios. (Online version in color.)

Table 2 shows that compared with the full model, i.e., RDDPA 1.0×, our compression ratio can reach 3× without any accuracy loss, the weights, FLOPs, Madd, memory usage, and storage space of the model are reduced to 2.2 M, 4.4 G, 8.6 G, 152.5 MB, and 25.5 MB respectively, and the inference speed is increased by 5.4 FPS on a single low-end GTX 960M GPU. The inference speed of our 4.3× compression ratio model is 30.2 FPS and the accuracy is only 0.2% mAP loss, which satisfies real-time detection on a single low-end GPU. Moreover, our 10.8× compression ratio model achieves 79.2% mAP at 40.1 FPS with only 0.7% mAP loss and 18.5 FPS increase in speed, and the model only contains 0.6 M weights and requires 96.8 MB memory usage and 7.1 MB storage space. Therefore, the resource occupation is greatly reduced, which makes it usable for applications in which real-time detections are needed.

Table 2. Parameter (Params, FLOPs, Madd, and Memory), accuracy (mAP), and speed (FPS) under different compression ratios.

Model	Params	FLOPs	Madd	Memory	Storage	FPS	mAP (%)
RDDPA 1.0×	6.7 M	6.2 G	12.3 G	160.5 MB	76.2 MB	21.6	79.9
RDDPA 3.0×	2.2 M	4.4 G	8.6 G	152.5 MB	25.5 MB	27.0/+5.4	79.9/+0.0
RDDPA 4.3×	1.6 M	3.5 G	6.9 G	119.4 MB	17.9 MB	30.2/+8.6	79.7/−0.2
RDDPA 10.8×	0.6 M	1.8 G	3.6 G	96.8 MB	7.1 MB	40.1/+18.5	79.2/−0.7

In what follows, our end-to-end pruning approach is compared with other representative structured pruning methods at the same compression ratio, including network slimming (NS)³⁰⁾ and rethinking the value of network pruning (Rethink).¹³⁾ NS imposes L₁ sparsity on the scaling factor of the channel for sparse training, then prunes the channel with a smaller scaling factor, and finally mitigates the accuracy loss through fine-tuning. In this work, we follow the NS method to sparsely train 230 epochs on the NEU-DET dataset and fine-tune the pruned model 150 epochs with a learning rate of 2.5×10⁻⁴. Rethink indicates that fine-tuning the pruned model with inherited weights is not better than training it from scratch.¹³⁾ Therefore, the pre-training weights that achieve 79.9% mAP on the NEU-DET dataset have been pruned directly, and two training strategies, namely Scratch-B and Scratch-E in the Rethink method, are adopted in this work. Scratch-E denotes training the pruned models for the same parameter settings while Scratch-B denotes doubling the number of epochs for training. pruning strategy has been adopted for both of these methods, i.e., the layer associated with the shortcut is not pruned. Meanwhile, in order to maintain the accuracy of the model and reduce the number of parameters, NS and Rethink can only compress the number of parameters of the model to three times. Thus, for the sake of fairness, a compression ratio of 3.0× is also set up to assess the effectiveness of our proposed model.

Table 3 shows that compared to the NS and Rethink methods, our method achieves 79.9% mAP at the 3× compression ratio, which costs a relatively small training budget (395 epochs) to obtain the same accuracy as the unpruned full model. Although NS only requires 380 epochs, its detection accuracy is 3.8% lower than that of our method while additional sparse training and fine-tuning stages are required. Moreover, our method implements end-to-end pruning in a real sense for object detection without pre-training weights, sparse training, and fine-tuning stages on the NEU-DET dataset.

Table 3. Comparison with other representative pruning approaches at the 3× compression ratio. Rethink* denotes training the pruned models with the training strategy of Scratch-B.

Approach	Compression Ratio	Pretrained (ImageNet)	Pretrained (NEU‑DET)	Sparsity	Fine‑tuned	Epochs	mAP (%)	ΔAcc (%)
NS	3.0×	✓	✗	✓	✓	380	76.1	−3.8
Rethink	3.0×	✓	✓	✗	✗	460	77.1	−2.8
Rethink*	3.0×	✓	✓	✗	✗	690	77.7	−2.2
Ours	3.0×	✓	✗	✗	✗	395	79.9	−0.0

4.4. Evaluation of RDDPA Framework

To verify the effectiveness of our method, several representative two-stage object detectors, e.g., Faster R-CNN,⁷⁾ Cascade R-CNN,³⁴⁾ and DDN,²¹⁾ and one-stage object detectors, e.g., ATSS,²³⁾ FCOS,⁹⁾ SSD,¹⁸⁾ CenterNet,²⁴⁾ DEA_RetinaNet,⁸⁾ and RDN³⁶⁾ have been compared with our proposed RDDPA framework on the same GPUs.

As shown in Table 4, RDDPA 10.8× achieves 79.2% mAP at 103.7 FPS on the Titan X GPU and 40.1 FPS on the GTX 960M GPU with only 0.6 M Weights, 1.8 GFLOPs, which is higher mAP than the representative two-stage detectors Cascade R-CNN and Faster R-CNN but lower mAP than DDN-ResNet50. However, the inference speed of our RDDPA on a single low-end GPU or a single mid-end GPU is much faster than Cascade R-CNN, Faster R-CNN, and DDN. On the mid-end Titan X GPU, the inference speed of RDDPA runs nearly 10× faster than Cascade R-CNN and DDN-ResNet50, and about 5× faster than Faster R-CNN and DDN-ResNet34. On the low-end GTX 960M GPU, RDDPA runs 22× faster than DDN-ResNet50 with comparable performance. Compared with the well-known one-stage detectors, the performance of RDDPA is better than all detectors, and the inference speed of RDDPA is much faster than ATSS, FCOS, SSD, CenterNet, and RDN (about 6.7×, 5×, 3.2×, 2.5×, 1.6× speedup on Titan X GPU and 18.2×, 12.5×, 9.1×, 5.2×, 2.5× speedup on GTX 960M GPU respectively).

Table 4. Parameter (Params and FLOPs), accuracy (mAP), and speed (FPS) comparison with other object detection approaches.

Approach	Backbone	Params	FLOPs	FPS*	FPS	mAP (%)
*two-stage:*
Cascade R-CNN	ResNet50	68.9 M	350.1 G	11.5	2.2	74.3
Faster R-CNN	ResNet50	41.2 M	322.3 G	20.7	4.2	77.5
DDN	ResNet34	28.2 M	–	17.1	3.3	74.8
DDN	ResNet50	97.0 M	–	11.0	1.8	82.3
*one-stage:*
ATSS	ResNet50	32.1 M	–	20.0	3.0	63.4
ATSS	ResNet101	50.0 M	–	15.4	2.2	67.8
FCOS	ResNet50	31.9 M	315.0 G	20.0	3.2	71.3
SSD300	VGG16	24.4 M	30.7 G	32.0	4.4	74.8
CenterNet	DLA34	19.7 M	17.8 G	41.5	7.7	77.1
DEA_RetinaNet	ResNet50	42.2 M	105.3 G	–	–	79.1
RDN	ResNet18-dsf	24.0 M	6.89 G	64.0	15.9	80.0
*Ours:*
RDDPA 1.0×	Tiny-Darknet	6.7 M	6.2 G	85.6	21.6	79.9
RDDPA 3.0×	Tiny-Darknet	2.2 M	4.4 G	92.9	27.0	79.9
RDDPA 4.3×	Tiny-Darknet	1.6 M	3.5 G	98.0	30.2	79.7
RDDPA 10.8×	Tiny-Darknet	0.6 M	1.8 G	103.7	40.1	79.2

Note that FPS* indicates frames per second tested on a mid-end Titan X GPU while FPS is tested on a low-end GTX 960M GPU.

Moreover, it can be seen from Table 4 and Fig. 6 that even a small number of one-stage detectors can perform real-time detection on a mid-end GPU, i.e., FCOS, SSD, CenterNet, and RDN. However, the detection accuracy of our RDDPA is much higher than CenterNet, SSD, and FCOS. Compared with RDN, our proposed RDDPA 1.0× achieves a reduction in both the Params and FLOPs by 17.3 M and 0.69 G, respectively, while incurring only a minimal 0.1% loss in mAP. In contrast to RDDPA 10.8×, while RDN has demonstrated a 0.8% improvement in mAP, the Params has escalated by 40 times, and the FLOPs have increased by 5.09 G. On a low-end GPU, only our RDDPA 4.3× and RDDPA 10.8× can realize real-time defect detection with high detection performance on the NEU-DET dataset. In other words, the huge inspection cost has been effectively reduced in this work.

Fig. 6. Comparison of the proposed RDDPA and other state-of-the-art steel surface defect detectors. Only our RDDPA 4.3× and RDDPA 10.8× can achieve real-time detection on a low-end GPU with high accuracy. (Online version in color.)

4.5. Discussion

4.5.1. End-to-end Pruning

The traditional CNN-based pruning pipeline contains three training stages. After all, it is generally deemed that training a sparse model on the benchmark dataset is particularly important for pruning because it provides highly representative weights inherited from the original network, such as network slimming³⁰⁾ and AutoPruner.³⁵⁾ In what follows, this work will show that the effect of sparsely trained weights on the NEU-DET dataset is quite different from what we previously thought in the common network pruning pipeline. In the light of this surprising observation, a novel pruning pipeline is presented where the pruned structure can be obtained directly from the pre-training weights on ImageNet, therefore the end-to-end pruning for detection network can be achieved. Specifically, the weights of the epochs at different sparse training phases (interval is 30) have been saved, and then they are used as the initial weights of the network to explore whether the sparse training procedure is crucial to the final pruned structure.

From Fig. 7, it can be seen that whether pruning the weights which are pre-trained on ImageNet or after sparse training, the network structures obtained are homogeneous no matter what the pruning rate is. Therefore, two conclusions can be drawn from this phenomenon. First, the weights and their sparsity that are pre-trained on such a large-scale ImageNet benchmark are more representative and have strong generalization ability. Second, sparse training on the NEU-DET dataset has no effect on the final pruned network structure, while the pre-training weights on ImageNet directly affect the final pruned network structure. Therefore, sparse training is removed in our work, and there are three benefits that this pruning approach can bring. First, high channel compression ratios can be achieved without taking the issue that pruning can cause the channel mismatch in shortcuts between the backbone network and the neck into account, which is seldom addressed in previous channel pruning solutions. Second, the task of pruning the detection network is simplified to pruning the classification network, so it is conducive to the realization of end-to-end pruning, i.e., training the pruned detection network from scratch to get rid of the full pruning pipeline. Third, this approach has not introduced additional parameters or indexes, thus this pruning method will greatly facilitate training and inference acceleration.

Fig. 7. Exploring the effect of weights at different epochs after sparsity training on the pruning structure. We show the channel numbers in each layer of the pruned backbone network at different pruning rates. ‘p’ is an abbreviation of the pruning rate. (Online version in color.)

4.5.2. Our Solution to Mitigate Accuracy Loss

Extensive experiments are also conducted to explore the relationship between learning rate and pruning rate. Specifically, the learning rate under a specific pruning rate has been gradually increased to visualize the convergence effect of the training loss. It can be seen from Fig. 8(a) that with the increase of learning rate, the benefits brought by loss convergence first gradually increase, and then decrease sharply. Moreover, with pruning rate increases (see Figs. 8(b)–8(c)), the benefits from increasing the learning rate turn to a sharp increase and then a gradual decrease. This phenomenon indicates that models with higher pruning rates require a larger learning rate to achieve better convergence. For example, the pruning rates of 0.34, 0.43, and 0.55 correspond to the best learning rates of 1.5×, 1.6×, and 2.2×, respectively. In Fig. 8(d), the potential relationship between learning rate and pruning rate is regarded as a 2D point (p, n), where p is the pruning rate, n is the magnification of the initial learning rate, and three optimal values under a specific pruning rate have been chosen in this work. As shown in Fig. 8(d), with the increase of pruning rate, the corresponding optimal learning rate also increases. Therefore, a curve is leveraged to approximately fit this distribution, which is defined as LS in this work to get accuracy compensation.

Fig. 8. Exploring the effect of learning rate on the pruned model. (a–c) Training loss with different learning rates at the pruning rates of 0.34, 0.43, and 0.55, respectively. (d) Distribution of optimal learning rates under different pruning rates. Note that certain layers in the network will be completely removed once the pruning rate exceeds 0.65. (Online version in color.)

In this work, it is found in practice that pruned models with higher compression ratios need a longer training stride to enable the network to converge better. Therefore, Eq. (6) is adopted to get the corresponding training epochs, which amounts to a similar computation budget as the full model training. Empirically, this training scheme is essential to improve the performance of the pruned models. As shown in Fig. 9 (Right), the loss curve of the training with LS is closer to that of the full network, which can effectively compensate for the accuracy loss caused by pruning.

Fig. 9. Training loss for different compression ratios without (Left) & with (Right) LS. The pruning ratios corresponding to compression ratios of 3.0, 4.3, and 10.8 are 0.34, 0.43, and 0.55, respectively. (Online version in color.)

Table 5 also demonstrates that the benefits of the LS increase as the pruning rate increases. For example, the performance gain it brings is 0.9% mAP at a 3× compression ratio and 2.4% mAP at a 10.8× compression ratio. Therefore, LS is critical for the pruned models with a large pruning rate to achieve similar performance to the full network in this work.

Table 5. Accuracy (mAP) for training with & without LS.

Compression Ratio	mAP (w/o)	mAP (w/)	Δ
1.0×	79.9%	79.9%	+0.0%
3.0×	79.0%	79.9%	+0.9%
4.3×	78.1%	79.7%	+1.6%
10.8×	76.8%	79.2%	+2.4%

5. Conclusions

This work proposes RDDPA, a real-time defect detection framework that can be deployed on a single general-purpose device via a novel and straightforward pruning scheme, which can address the issue of the traditional three-stage pruning pipeline to realize end-to-end pruning in a real sense. Extensive experiments have evaluated that our method is concise and efficient, and can realize real-time defect detection on a single low-end GPU with high accuracy by removing huge redundant weights.

Although our models have generally achieved good results on the NEU-DET dataset, the mAP of the best model can only reach 79.9%, consequently, there is still much room for improvement. As future work, one direction is to improve the detection accuracy using data augmentation technology or stronger backbone networks. Another investigation is to perform the defect segmentation task which can produce a more precise defect boundary. Additionally, extending RDDPA to other industrial application scenarios would be a promising research topic.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (Nos. 62172004, 62072002, and 61872004). Anhui Province Collaborative Innovation Project (Nos. GXXT-2022-050, GXXT-2022-053). Educational Commission of Anhui Province (No. 2022AH050336).

References

1) D. Li, S. Ge, K. Zhao and X. Chen: ISIJ Int., 63 (2023), 525. https://doi.org/10.2355/isijinternational.ISIJINT-2022-201
2) A. Bouguettaya, Z. Mentouri and H. Zarzour: Int. J. Adv. Manuf. Technol., 125 (2023), 5313. https://doi.org/10.1007/s00170-023-10947-8
3) W. Wang, Z. Wu, K. Lu, H. Long, D. Li. J. Zhang, P. Chen and B. Wang: ISIJ Int., 62 (2022), 1222. https://doi.org/10.2355/isijinternational.ISIJINT-2021-051
4) Y. Liang, K. Xu, P. Zhou and D. Zhou: Adv. Eng. Inform., 53 (2022), 101672. https://doi.org/10.1016/j.aei.2022.101672
5) R. Wei, Y. Song and Y. Zhang: ISIJ Int., 60 (2020), 539. https://doi.org/10.2355/isijinternational.ISIJINT-2019-335
6) J. Guan, J. Fei, W. Li, X. Jiang, L. Wu, Y. Liu and J. Xi: Opt. Lasers Eng., 163 (2023), 107488. https://doi.org/10.1016/j.optlaseng.2023.107488
7) S. Ren, K. He, R. Girshick and J. Sun: IEEE Trans. Pattern Anal. Mach. Intell., 39 (2017), 1137. https://doi.org/10.1109/TPAMI.2016.2577031
8) X. Chen and J. Yu: IEEE Trans. Instrum. Meas., 70 (2021), 1. https://doi.org/10.1109/TIM.2020.3040485
9) Z. Tian, C. Shen, H. Chen and T. He: Proc. 2019 IEEE/CVF Int. Conf. on Computer Vision (ICCV), IEEE, New York, (2020), 9626. https://doi.org/10.1109/ICCV.2019.00972
10) M. Tan, R. Pang and Q. Le: Proc. 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, (2020), 10778. https://doi.org/10.1109/CVPR42600.2020.01079
11) R. Huang, J. Pedoeem and C. Chen: Proc. 2018 IEEE Int. Conf. on Big Data (Big Data), IEEE, New York, (2018), 2503. https://doi.org/10.1109/BigData.2018.8621865
12) M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. C. Chen: Proc. 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, (2018), 4510. https://doi.org/10.1109/CVPR.2018.00474
13) Z. Liu, M. Sun, T. Zhou, G. Huang and T. Darrell: Proc. 2018 Int. Conf. on Learning Representations (ICLR), (2019).
14) X. Ma, F. M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren and Y. Wang: Proc. 34th AAAI Conf. on Artificial Intelligence (AAAI), AAAI, Palo Alto, CA, (2020), 5117. https://doi.org/10.1609/aaai.v34i04.5954
15) X. Ma, W. Niu, T. Zhang, S. Liu, S. Lin, H. Li, W. Wen, X. Chen, J. Tang, K. Ma, B. Ren and Y. Wang: Proc. 16th European Conf. on Computer Vision (ECCV), Springer, Cham, (2020), 629. https://doi.org/10.1007/978-3-030-58601-0_37
16) Y. Wang, X. Zhang, L. Xie, J. Zhou, H. Su, B. Zhang and X. Hu: Proc. 34th AAAI Conf. on Artificial Intelligence (AAAI), AAAI, Palo Alto, CA, (2020), 12273. https://doi.org/10.1609/aaai.v34i07.6910
17) J. Deng, W. Dong, R. Socher, L. J. Li, K. Li and L. Fei-Fei: Proc. 2009 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, (2009), 248. https://doi.org/10.1109/CVPR.2009.5206848
18) W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. Berg: Proc. 14th European Conference on Computer Vision (ECCV), Springer, Cham, (2016), 21. https://doi.org/10.1007/978-3-319-46448-0_2
19) J. Redmon, S. Divvala, R. Girshick and A. Farhadi: Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, (2016), 779. https://doi.org/10.1109/CVPR.2016.91
20) R. Hao, B. Lu, Y. Cheng, X. Li and B. Huang: J. Intell. Manuf., 32 (2021), 1833. https://doi.org/10.1007/s10845-020-01670-2
21) Y. He, K. Song, Q. Meng and Y. Yan: IEEE Trans. Instrum. Meas., 69 (2020), 1493. https://doi.org/10.1109/TIM.2019.2915404
22) T. Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár: Proc. 2017 IEEE Int. Conf. on Computer Vision (ICCV), IEEE, New York, (2017), 2999. https://doi.org/10.1109/ICCV.2017.324
23) S. Zhang, C. Chi, Y. Yao, Z. Lei and S. Z. Li: Proc. 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, (2020), 9756. https://doi.org/10.1109/CVPR42600.2020.00978
24) X. Zhou, D. Wang, and P. Krähenbühl: arXiv e-prints, arXiv: 1904.07850, (2019). https://doi.org/10.48550/arXiv.1904.07850
25) Z. Liu, T. Zheng, G. Xu, Z. Yang, H. Liu and D. Cai: Proc. 34th AAAI Conference on Artificial Intelligence (AAAI), AAAI, Palo Alto, CA, (2020), 11685. https://doi.org/10.1609/aaai.v34i07.6838
26) J. Frankle and M. Carbin: Proc. 2019 Int. Conf. on Learning Representations (ICLR), (2019).
27) Y. Cai, H. Li, G. Yuan, W. Niu, Y. Li, X. Tang, B. Ren and Y. Wang: Proc. 35th AAAI Conf. on Artificial Intelligence (AAAI), AAAI, Palo Alto, CA, (2021), 955. https://doi.org/10.1609/aaai.v35i2.16179
28) W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang and B. Ren: Proc. 25th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), ACM, New York, (2020), 907. https://doi.org/10.1145/3373376.3378534
29) Z. Li, G. Yuan, W. Niu, P. Zhao, Y. Li, Y. Cai, X. Shen, Z. Zhan, Z. Kong, Q. Jin, Z. Chen, S. Liu, K. Yang, B. Ren, Y. Wang and X. Lin: Proc. 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE, New York, (2021), 14250. https://doi.org/10.1109/CVPR46437.2021.01403
30) Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan and C. Zhang: Proc. 2017 IEEE Int. Conf. on Computer Vision (ICCV), IEEE, New York, (2017), 2755. https://doi.org/10.1109/ICCV.2017.298
31) K. Yao, F. Cao, Y. Leung and J. Liang: Pattern Recogn., 119 (2021), 108056. https://doi.org/10.1016/j.patcog.2021.108056
32) H. Law and J. Deng: Int. J. Comput. Vis., 128 (2020), 642. https://doi.org/10.1007/s11263-019-01204-1
33) K. Song and Y. Yan: Appl. Surf. Sci., 285 (2013), 858. https://doi.org/10.1016/j.apsusc.2013.09.002
34) Z. Cai and N. Vasconcelos: IEEE Trans. Pattern Anal. Mach. Intell., 43 (2021), 1483. https://doi.org/10.1109/TPAMI.2019.2956516
35) J. H. Luo and J. Wu: Pattern Recogn., 107 (2020), 107461. https://doi.org/10.1016/j.patcog.2020.107461
36) W. Wang, C. Mi, Z. Wu, K. Lu, H. Long, B. Pan, D. Li, J. Zhang, P. Chen and B. Wang: IEEE Trans. Instrum. Meas., 71 (2022), 1. https://doi.org/10.1109/TIM.2021.3127648

責任著者(Corresponding author)

早期公開記事改版情報

J-STAGEへの登録はこちら（無料）