ISIJ International
Online ISSN : 1347-5460
Print ISSN : 0915-1559
ISSN-L : 0915-1559
Regular Article
SOKD: A Soft Optimization Knowledge Distillation Scheme for Surface Defects Identification of Hot-Rolled Strip
Wenyan WangZheng RenCheng WangKun LuTao TaoXuejuan PanBing Wang
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2025 Volume 65 Issue 1 Pages 104-110

Details
Abstract

The surface defect of hot-rolled strip is a significant factor that impacts the performance of strip products. In recent years, convolutional neural networks (CNNs) have been extensively used in strip surface defect recognition to ensure product quality. However, the existing CNNs-based methods confront the challenges of high complexity, difficult deployment and slow inference speed. Accordingly, this work proposes a soft optimization knowledge distillation (SOKD) scheme to distill the ResNet-152 large model and extract a compact strip surface recognition model. The SOKD scheme utilizes Kullback-Leibler (KL) divergence to minimize the error between the soft probability distributions of the student network and the teacher network, and gradually reduces the weight of “Hard loss” during the training process. The operation significantly reduces the learning constraints that the prior knowledge of the teacher network on the student network in the original KD, which improves the recognition performance of the model. Additionally, SOKD is applicable to most CNNs for identify surface defect of hot-rolled strip. The experimental results on NEU-CLS dataset show that the SOKD outperforms state-of-the-art methods.

1. Introduction

Hot-rolled strip has been widely used in automotive, electrical, chemical and other fields, and its surface defects have a great influence on the quality of strip steel. Therefore, it is necessary to automatically detect the surface defects of hot-rolled strip to reduce economic losses.

Currently, surface defect recognition technology is mainly based on machine vision, which can be divided into traditional machine learning and deep learning methods. Traditional machine learning methods are mainly composed of feature extraction module and classifier, which have been proposed a lot of meaningful work.1,2,3) Gong et al. proposed a support vector hyper-spheres classifier, which enhances classification efficiency by constructing independent hyperspheres for each type of defect. Additionally, the classifier effectively mitigates the impact of noise on defect recognition by introducing pinball loss and adding the weight of samples in the class.4) Liu et al. combined the MB-LBP feature with the weighted voting cascade classifier to achieve higher recognition accuracy for steel plate surface defects.5) Ghorai et al. located the three-level Haar feature set by SVM, which alleviated the problems of large surface, variation in appearance and rare occurrences in defect recognition.6) However, powerful features usually require massive experiments and manual selection, such as Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), and wavelet decomposition.7,8,9) The commonly used classifiers in machine learning include Bayesian, K-nearest Neighbors, and SVM, along with other statistical methods, which generally face challenges such as low classification accuracy and poor robustness. Therefore, while these works have achieved high recognition accuracy by combining machine learning algorithms to design various defect features, these methods often involve complex processing procedures and lack robustness, making it challenging to meet the requirements of real-time recognition.

In recent years, benefiting from the development of deep learning, the challenges of manual design features and poor robustness in traditional machine learning methods are gradually being addressed. In particular, CNNs utilize the network for autonomous learning in the data, thereby significantly enhancing the efficiency and accuracy of feature extraction. Consequently, CNNs are being increasingly used for automated identification tasks involving surface defects on hot-rolled strips.10,11,12,13,14,15) Feng et al. integrated the Convolutional Block Attention Module (CBAM) into the ResNet50 network, and achieved an accuracy of 94.11% on the surface defect classification dataset.16) Wan et al. proposed a maximum average feature extraction module based on the VGG19 network, which reduced the difficulty of high-brightness defect image recognition.17) Although the CNN-based method has continuously improved the accuracy of surface defect recognition tasks, its network structure has become increasingly complex and has more parameter redundancy, which is not conducive to practical industrial applications. In order to enhance this situation, researchers have proposed lightweight networks for identifying surface defects. Hu et al. achieved faster recognition results using a lightweight MobileNetV2 network.18) Fu et al. constructed a multi-scale pooling CNN based on SqueezeNet and implemented 130 FPS on the NEU surface defect dataset.19) This lightweight CNN reduces partial redundant parameters in its structure, thereby improving the inference speed of the network to a certain extent. However, this reduction in parameters often leads to a loss in recognition accuracy, and it does not completely eliminate unnecessary parameters. Moreover, this lightweight network generally designs some special convolution structures to reduce the complexity of the network, which has poor compatibility in actual deployment.20,21,22)

To addressed this issue, this work proposes a soft optimization knowledge distillation scheme that transfer valuable knowledge from a large model to a small model. Specifically, a highly complex CNN model (Teacher) is trained to achieve powerful performance for the surface defect recognition task. Next, a compact substructure (Student) is extracted from the Teacher. Moreover, to get rid of the constraints of prior knowledge in the Teacher, a softened optimization scheme is introduced.

2. Methodology

2.1. Soft Optimization Knowledge Distillation

Unlike the training strategy based on CNNs, SOKD consists of two main components: self-learning of the student network and knowledge transmission from the teacher network. The overall structure of SOKD is illustrated in Fig. 1, the student network minimizes the difference between the probability distribution P of the forward propagation and its one-hot code by self-learning. The teacher network is to transfer knowledge to the student network. Finally, the prediction results of these two branches are fused to improve the accuracy of defect identification. In addition, the parameters of teacher network are frozen, and student network will be updated during the training process.

Fig. 1. Knowledge distillation structure. (Online version in color.)

When the student network engages in self-learning, it can generate the predicted probability distribution of the defect category through forward propagation of the network, given a defective image and its corresponding one-hot coded label as input. At last, the cross-entropy loss function is used to iteratively optimize the parameters of the student network. In short, the optimization goal of the student network self-learning is to minimize the cross-entropy loss function. This work defines it as “Hard loss”:

  
L HARD ( S θ )=- 1 N i=1 N j=1 M y ij log( p S θ ( x i )) (1)

where M and N represent the number of categories and total quantity of samples, respectively, the yij is the true label of input xi, the p S θ ( x i ) denote the output probability distribution of student network Sθ. The knowledge transfer of teacher network refers to a powerful model imparting its own prior knowledge to a small model. Consequently, the teacher network is commonly more powerful than the student network, which it can instruct students to obtain superior performance after self-learning. However, the output of the teacher network is close to one-hot encoding, resulting in a large probability difference between the real category and other categories, which cannot provide additional information for the student network. In order to solve this problem, Hinton et al. introduced a temperature factor, denoted as “T”, in the Softmax function. This factor amplifies the predicted values of other categories, thereby providing additional category information for the student network during the self-learning process.23) The probability distribution of the model output will become “soft” with the increase of “T”, and this soft probability distribution will prompt the student network that “this is a classification with the greatest probability, but it is more similar to some classes and not similar to other classes at all.” Accordingly, as long as the value of “T” is within a certain range, a set of relatively suitable soft probability distributions can be generated. The formula is defined as follows:

  
q T θ ( x i )= exp( z i /T) k    exp( z k /T) (2)

where zi is the logarithm of the i class and T is the temperature factor. The probability distribution of the predicted classes will become softer as the temperature increases. Due to the KL divergence can measure the difference in the soft probability distribution between the teacher network and the student network, this work adopts it to minimize the error of the soft probability distribution between the student network and the teacher network, which is called “Soft loss”. The Soft loss is defined as follows:

  
L SOFT ( S θ )= - 1 N i=1 N q T θ ( x i )log( p S θ ( x i ))+ 1 N i=1 N q T θ ( x i )log( q T θ ( x i )) (3)

where q T θ ( x i ) is known, the second term of formula (3) is a constant. Consequently, the overall objective function is:

  
L( S θ , x i )=α T 2 L SOFT ( S θ )+(1-α) L HARD ( S θ ) (4)

where α and T2 are applied to weigh the proportions of LSOFT (Sθ) and LHARD (Sθ). In the original KD, the teacher network generally employs a higher weight α to guide student network, which the best recognition result can be obtained by the student model.

However, the idea that “student networks can always learn better under the continuous guidance of teachers” is different from the actual results. To overcome this issue, this work proposes a soft optimization scheme that eliminates the prior knowledge constraints of the teacher network. Specifically, the network gradually decreases the value of α throughout the training process. Therefore, the student network focuses on learning the teacher network during the first half of the training, and gradually gets rid of the constraints of the prior knowledge from the teacher network in the late stage of the training, which means it pays more attention to reducing the “Hard loss”.

2.2. Structures of Teacher and Student Networks

The teacher network typically selects the network that achieves excellent recognition accuracy to guide the learning of the student network. Therefore, this work utilizes the large ResNet-152 as the teacher network and initializes it with ImageNet pre-training weights during the training process. As shown in Table 1, the teacher network contains 50 bottleneck residual blocks, and obtains 83.7% top-1 accuracy and 96.7% top-5 accuracy on the ImageNet validation set. In this work, the last layer of the teacher network is fine-tuned to train the recognition of hot-rolled strip surface defects.

Table 1. Structure and parameters of teacher network.

Layer nameOutput sizeConvolution block
Conv1112×112[7×7, 64,]
Conv2_x56×563×3 Max pool
[ 1×1,   64 3×3,   64 1×1,   256 ]×3
Conv3_x48×48 [ 1×1,   128 3×3,   128 1×1,   512 ]×8
Conv4_x24×24 [ 1×1,   256 3×3,   256 1×1,   1   024 ]×36
Conv5_x12×12 [ 1×1,   512 3×3,   512 1×1,   2   048 ]×3
FC1×1Average pool, 6-d, Softmax
FLOPs11.3×109

In order to extremely compress the parameters of the model, the structure of the student network is designed to resemble that of the teacher network. It consists of bottleneck residual blocks, which effectively reduce both training and inference time. The structure of the student network as shown in Fig. 2(a), a Unit represents a basic residual block. The relationship between the number of Units and Parameters, floating-point operations (FLOPs) is illustrated in Fig. 2(b), with the increase in the number of units in the student network, the parameters and FLOPs are also increasing. When the number of units reaches to 5, the parameters of the student network increase sharply from 100 M to around 500 M. Consequently, to train an accurate and compact model, this work adopts 1, 2, 3, and 4 bottleneck residual units as the student network, respectively.

Fig. 2. Structure and parameter curve of student network. (Online version in color.)

3. Experiments and Results

3.1. Dataset

This work conducts experiments to evaluate the effectiveness of SOKD on the NEU-CLS dataset, which contains six types of defects: crazy, inclusion, patches, scratches, pitted surface, and rolled-in scale.24) Some of them are illustrated in Fig. 3. Each category of defect includes 300 images, and the resolution of each image is 200 * 200 pixels.

Fig. 3. Defect images of different categories in the NEU-CLS dataset. (a) crazing, (b) inclusion, (c) patches, (d) pitted surface, (e) rolled-in scale, (f) scratches.

3.2. Implementation Details

All experiments in this work are performed on the same machine, the GPUs for training and testing models are NVIDIA Tesla V100, and the software platforms are Pytorch1.3, CUDA10.1 and CUDNNv9. The input image size and batch size are set as 224×224 and 32, respectively. In addition, for fear of the network falling into the gradient explosion during the initial training stage, which leads to the plain phenomenon that the performance of the model will stay for a period of time before the learning rate drops, this work adopts the cosine annealing learning rate decay scheme.25)

According to Fig. 4, the traditional step decay scheme requires setting a fixed and large learning rate in the initial training stage, and then reduce it by x times after continuous training for a period of time. This approach enables the model to learn rapidly in the first half of training and more smoothly in the second half to prevent overfitting. However, this scheme may cause gradient explosion in the early stage of training, and maintain stable performance in the middle stage of training, but it will not be improved. Conversely, the cosine annealing learning rate decay scheme with preheating needs to set a smaller learning rate in the initial stage of training, until the model returns to the initial learning rate after stable learning. This work uses the mean and standard deviation of five independent experiments as experimental results to verify the effectiveness and robustness of the model.

Fig. 4. Comparison of preheated cosine annealing and traditional stepped learning rate decay scheme. (Online version in color.)

3.3. Evaluation Metrics

This work primarily focuses on the multi-classification task of surface defects in hot-rolled strips, and the evaluation indicators are required to measure the generalization ability and recognition performance of the model. Accordingly, this work uses Precision, Recall, F1-Score, and Accuracy as evaluation metrics. The formulas for these indicators are defined as follows:

  
Accuracy= TP+TN TP+FP+FN+TN (5)

  
Precision= TP TP+FP (6)

  
Recall= TP TP+FN (7)

  
F1= 2×Precision×Recall Precision+Recall (8)

where TP indicates the number of samples with positive labels and true predictions, FP represents quantity of samples with positive labels and false predictions, TN indicates the amounts of samples with negative labels and true predictions, and FN represents the number of samples with negative labels and false predictions.

3.4. Experiment Results and Analysis

In this work, 1, 2, 3, and 4 bottleneck residual units are used as the student network (referred to as Unit-1, Unit-2, Unit-3, Unit-4 for short), respectively. As shown in Table 2, the Teacher model achieves 100% Accuracy, Precision, Recall, and F1 score. However, the teacher model has parameters as high as 58.18 M, the calculation reaches 11820 M FLOPs, and the inference speed is only 22 FPS. Compared with the Teacher model, Unit-1 achieves the same level of accuracy, and the Parameters are only 0.03 M, the FLOPs are only 194.7 M, the inference speed reaches 200 FPS. Even if unit-2, unit-3, and unit-4 also achieve 100% recognition performance, other metrics such as parameter, Flops, and FPS are poorer than Unit-1. Specifically, the parameters and FLOPs of units-2, -3, and -4 increase continuously with the increase of residual structure, especially for the unit-4 with parameters and FLOPs of 5.16 M and 997.93 M, respectively. However, the inference speed of these units gradually slows down, and the inference speed of the unit-4 is only 114 FPS. Therefore, the student network with only one bottleneck residual unit has the best performance.

Table 2. Evaluation results of soft optimization knowledge distillation method.

NetworksParams (M)FLOPs (M)FPSAccuracy (%)Precision (%)Recall (%)F1-Score (%)
Teacher58.181182022100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
Unit-10.03194.70200100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
Unit-20.27462.44160100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
Unit-31.24730.19133100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
Unit-45.16997.93114100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00

In deep learning, visualizing the features learned in some layers of the network is beneficial for researchers to understand whether the model extracts important information. However, the spatial features learned by deep networks are generally high-dimensional and unobservable. Accordingly, t-distributed stochastic neighbor embedding (t-SNE) is applied to decrease high-dimensional data to two-dimensional plane.26) To inspect whether the model has effectively learned defect information, this work visualized the features of the last convolutional layer of Unit-1 by t-SNE. Figure 5 illustrates that defect features within the same class are closely clustered together, while different categories are far apart, indicating that the model exhibits a smaller intra-class vergence and a larger inter-class vergence.

Fig. 5. t-SNE visualization of extracted features. (Online version in color.)

3.5. SOKD Extension

The above results demonstrate that the method proposed in this work can effectively extracts the most simplified substructure from the complex teacher network. To further verify the availability and robustness of SOKD, this work extends the method to other well-known CNNs while maintaining the teacher network unchanged. Such as SqueezeNet,27) DenseNet,28) MixNet,29) EfficientNet,30) GhostNet,31) MobileNetV2,32) MobileNetV3,33) and ShuffleNet,21) extract a basic unit from each of these CNNs as a student network. The results as illustrated in Table 3, SqueezeUnit, DenseUnit, MobileNetV3Unit, and Bottleneck all achieve 100% recognition performance with smaller Params and FLOPs. Moreover, the basic units of the rest of the network realize more than 99% recognition performance, and their parameters and FLOPs are much less than the teacher network. Therefore, it can be seen that the SOKD proposed in this work is effective and scalable.

Table 3. Results of the soft optimized knowledge distillation method extended to other classification networks.

NetworksParams (M)FLOPs (M)FPSAccuracy (%)Precision (%)Recall (%)F1-Score (%)
Teacher58.181182022100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
SqueezeUnit0.03194.7198100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
DenseUnit0.19692.04196100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
MixUnit0.03129.0118999.59 ± 0.0899.60 ± 0.0899.59 ± 0.0899.59 ± 0.08
EffiInvResUnit0.03141.8618699.59 ± 0.2199.60 ± 0.2299.59 ± 0.2199.59 ± 0.21
GhostUnit0.03141.0015099.77 ± 0.0799.78 ± 0.0799.77 ± 0.0799.77 ± 0.07
MobileNetV2Unit0.09244.4621299.96 ± 0.0899.96 ± 0.0799.96 ± 0.0899.96 ± 0.08
MobileNetV3Unit0.06172.82182100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
BasicBlock0.24301.9620899.85 ± 0.0899.85 ± 0.0799.85 ± 0.0899.85 ± 0.08
Bottleneck0.03177.02200100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00
ShuffleV2Unit0.02142.4518499.48 ± 0.1899.49 ± 0.1999.48 ± 0.1899.48±0.18

4. Discussion

4.1. Comparison with Other Advanced Defect Identification Methods

This work conducts controlled experiment on NEU-CLS dataset by comparing SOKD with seven state-of-the-arts: 1) VGG19;17) 2) MobileNetV2-TARGAN;18) 3) ResNet-101;34) 4) DarkNet-53;34) 5) SCN;34) 6) ResNet50+CBAM+FcaNet;17) 7) CNN.35) The experimental results are shown in Table 4, the Bottleneck trained using the SOKD method achieves the highest recognition performance (Accuracy, Precision, Recall, F1-Score), and the Params is reduced by 17x compared with the CNN with the least Params. Although the FLOPs of Bottleneck increased by 6x than MobileNetV2-TARGAN, the Params decreased by 31x, and the accuracy, precision, recall, and F1 score improved by 1.75%, 1.5%, 1.9%, and 1.7%, respectively. In summary, the SOKD method significantly improves the performance of the hot-rolled strip surface defect recognition task, making it is easier to implement in the actual production process.

Table 4. Comparison of the proposed method with other state-of-the-art methods.

MethodsParams (M)FLOPs (M)FPSAccuracy (%)Precision (%)Recall (%)F1-Score (%)
VGG1920.0419573.9697.6297.8697.8497.84
MobileNetV2
-TARGAN2.2326.0498.2598.5098.1098.30
ResNet-10142.5176002798.84
DarkNet-5340.5671183999.01
SCN3599.61
ResNet50+
CBAM+FcaNet26.0441143193.8794.3587.3388.71
CNN1.2599.05
Teacher (ours)58.181182022100100100100
Bottleneck (ours)0.07177.02200100 ± 0.00100 ± 0.00100 ± 0.00100 ± 0.00

4.2. Visualization of the Training Process

In order to illustrate that the performance of the student model is limited by the prior knowledge of the teacher model in the original KD, this work draws the curve of “Soft loss” and “Hard loss” during network training. As shown in Fig. 6, the blue solid line (BS) represents the soft loss between the teacher network and the student network, and the blue dotted line (BD) shows the hard loss between the student network and the actual label. When the Epoch is less than 50, the BD decreases rapidly under the guidance of the teacher network, which indicates that the network is converging rapidly. However, the student network excessively focuses on imitating the output of the teacher network, while neglecting its own self-learning in the second half of training. The specific performance is that it pays too much attention to minimize “Soft loss” and ignores “Hard loss”, which limits the further learning of student network. Figure 6 clearly demonstrates that the introduction of the SOKD has significantly reduced the value of the “Hard loss” as represented by the red dashed line. In this way, the student network can learn quickly under the guidance of the teacher network during the initial stage of training, and gradually get rid of the constraints from the prior knowledge of the teacher network in the second half of training.

Fig. 6. Training loss visualization curve. (Online version in color.)

4.3. Ablation Study

This work conducted ablation studies on the NEU-CLS dataset to investigate the impact of various factors on the model. As summarized in Table 5, the baseline refers to a student network that consists of only one residual unit, and the evaluation metrics are below 90% without the guidance of teacher network. After introducing the KD, the Accuracy, Precision, Recall and F1 score of the student model are improved by 10.6%, 10.68%, 10.6% and 10.79%, respectively. After implementing the cosine annealing learning rate attenuation scheme, the Accuracy, Precision, Recall, and F1 score of the student model improved by 0.18%. Furthermore, the application of SOKD resulted in the student network achieving 100% significant recognition performance.

Table 5. Ablation experimental results.

MethodsAccuracy (%)Precision (%)Recall (%)F1-Score (%)
Baseline88.92±0.1588.85±0.1488.92±0.1588.73±0.16
+ KD99.52±0.1099.53±0.1099.52±0.1099.52±0.10
+ Cosine99.70±0.1099.71±0.1099.70±0.1099.70±0.10
+ SOKD100±0.00100±0.00100±0.00100±0.00

5. Conclusion

Knowledge Distillation (KD) is a widely recognized technique for compressing models. In this work, we discussed the influence of prior knowledge of the teacher model on the student model in the training phase, and proposed a soft optimization knowledge distillation scheme (SOKD). The SOKD can get rid of the constraints imposed by the teacher model’s prior knowledge on the student model, and applied to a wide range of convolutional neural networks. The effectiveness of SOKD in overcoming the limitations of prior knowledge is demonstrated by the visualized loss curve. Ultimately, the student model achieves state-of-the-art performance for recognizing surface defects on hot-rolled strips after introducing SOKD training.

Acknowledgments

This work is financially supported by the Anhui Province Collaborative Innovation Project (Nos. GXXT-2022-053, GXXT-2022-050, GXXT-2023-021), National Natural Science Foundation of China (Nos. 62172004, 62072002, and 61872004), Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling (No. GJZZX2021KF02), Educational Commission of Anhui Province (No. 2022AH050336) and Open Fund of Anhui Engineering Research Center for Intelligent Applications and Security of Industrial Internet (No. IASII24-08).

References
 
© 2025 The Iron and Steel Institute of Japan.

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top