ISIJ International
Online ISSN : 1347-5460
Print ISSN : 0915-1559
ISSN-L : 0915-1559
Regular Article
Resformer-Unet: A U-shaped Framework Combining ResNet and Transformer for Segmentation of Strip Steel Surface Defects
Kun LuWenyan WangXuejuan PanYuming ZhouZhaoquan ChenYuan ZhaoBing Wang
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2024 Volume 64 Issue 1 Pages 67-75

Details
Abstract

Identifying surface defects is an essential task in the hot-rolled process. Currently, various computer vision-based classification and detection methods have achieved superior results in recognizing surface defects. However, defects typically exhibit irregular shapes caused by intra-class differences. Therefore, these two methods are unable to accurately identify the specific locations of the defects. To address this issue, this work proposes a U-shaped Encoder-Decoder framework called Resformer-Unet, which can effectively detect surface defects of hot-rolled strip at the pixel-level. In this framework, the Convolutional Neural Network (CNN) and Transformer work in parallel to extract multi-scale features from the image, which enhances the ability of network to capture both global and local information. Additionally, feature coupling modules are employed to fuse multi-scale features, thereby compensating for the information loss that occurs during down-sampling. On the SD-saliency-900 dataset for strip steel surface defect segmentation, Resformer-Unet achieves a mean Dice Similarity Coefficient (DSC) of 89.96% and an average Hausdorff Distance of 12.03%. These results outperform those of several advanced methods.

1. Introduction

Hot-rolled strip is a crucial raw material in both production and daily life. However, the surface of strip steel is prone to quality defects such as cracks, patches, and scratches due to limitations in the production process and environment. These defects not only damage the appearance of the strip but also cause alterations in its physical and chemical properties. Consequently, it is necessary to automatically identify the surface defects of the hot-rolled strip to ensure its quality.

Benefiting from the advancements in machine learning and computer vision, the automatic detection of surface defects in strip steel is continuously improving. Luo et al. applied a selectively dominant local binary patterns and nearest neighbor classifier to identify defects.1) Feng et al. utilized ResNet50 in combination with the FcaNet and attention module to recognize defects.2) Fu et al. proposed a multi-scale pooling CNN to accomplish high-accuracy classification of steel surface defects.3) Li et al. designed an attention layer and applied it to a shallow convolutional neural network structure to enhance the anti-noise ability of the model in identifying strip surface defects.4) Wang et al. proposed an improved deep learning model to address the issue of poor defect classification accuracy in cases where there are only a small number of labeled samples.5) Although these methods can improve the accuracy and efficiency of identifying surface defects on strips, there are still certain limitations. For instance, classification alone cannot determine the exact location of defects, nor can it address situations where multiple types of defects are present in a single image. Object detection is gradually addressing the challenges of identifying and locating various surface defects. Wang et al. have developed a modular encoding and decoding network with a lightweight architecture to address this problem.6) Li et al. proposed a novel method for detecting surface defects on steel strips by embedding an attention mechanism into the backbone network structure of YOLOv4.7) Chen et al. solved the problem of complex and irregular defect distribution by proposing a rapid detection network that integrates deformable convolution and attention.8) Wei et al. enhanced the Faster R-CNN network by introducing weighted interest pooling and deformable convolution modules to improve the network’s ability to detect small and irregular defects.9) While the detection algorithms can identify multiple types of defects in a single image frame, they usually only locate a rough rectangular region for the defect area and cannot accurately recognize the defect boundary. Furthermore, capturing the irregular shape of defects in industrial production poses a significant challenge. Accordingly, it is essential to develop an effective algorithm for segmenting surface defects on strips.

Since U-Net was proposed in 2015, numerous studies have demonstrated the effectiveness of the u-shaped symmetrical encoder-decoder structure for image segmentation tasks.10) Res-UNET integrates skip connection and weighted attention mechanism into U-Net architecture to address the issue of target information loss caused by the light source.11) The Dense-Unet model enhances the learning and generalization capabilities of the U-Net structure by incorporating dense convolution.12) DSUNet improves the segmentation performance of the network for surface defects in hot-rolled strips by replacing the traditional convolution layer with depth-wise separable convolution.13) In recent years, other approaches based on the U-shaped framework for image segmentation, such as UNet++, UNet3+, and SegNet, have demonstrated remarkable performance.14,15,16) The reason why these methods can achieve outstanding effect is that, in addition to the simplicity and superiority of the U-shaped structure, they also benefit from CNNs, which collect powerful local features as image representations through overlay network layering.17,18,19) However, CNN-based methods have limitations in capturing global features and representing long-distance relationships between visual elements due to the local operation of convolution.20) Expanding the receptive field of the network can improve this situation to a certain extent by utilizing large convolution kernels and multi-resolution pyramids.21,22,23,24) Nevertheless, this approach inevitably increases the computational cost, and its effect is not obvious.

Recently, the Transformer model has been proven effective in natural language processing tasks by introducing a multi-head self-attention mechanism that enables the extraction and memorization of long-distance information.25) As a result, it has been introduced into the domain of computer vision. The Vision Transformer (ViT) has shown superior performance compared to CNN models with similar parameters on multiple image classification benchmarks, particularly when the amount of data is sufficiently large.26) PVT overcomes the challenge of adapting the Transformer model to diverse intensive prediction tasks.27) The Swin Transformer introduces an innovative sliding window mechanism that enables the model to learn cross-window information.28) Swin-Unet proposes a pure Transformer architecture, similar to U-Net, that utilizes the Swin Transformer block as the fundamental unit for medical image segmentation tasks.20) Although Transformers are capable of acquiring global feature representations that incorporate spatial transformations and long-distance dependences, they often overlook local feature details, which can reduce the distinguishability between the background and foreground.29) Therfore, enhancing the ability of a model to simultaneously extract both local and global information from an image remains a challenging yet promising task.

In this work, we propose a dual network structure, Resformer-Unet, which combines the strengths of both CNN and Transformer to enhance the learning capability of the model for feature representations of strip surface defects. Resformer-Unet has designed a powerful image feature extractor that employs both CNN and Transformer to extract multi-scale features from the input image in parallel. Furthermore, the feature representation learned from the model is fully utilized by designing multiple feature coupling modules. Finally, according to the requirements of the actual segmentation task, an optimal loss function structure is designed, which is composed of cross-entropy loss and dice loss. The performance of Resformer-Unet is evaluated on the SD-saliency-900 dataset, and the experimental results indicate that our proposed model is capable of accurately segmenting strip steel surface defects.30)

2. Method

2.1. Overall Network Structure

The overall network structure of Resformer-Unet is illustrated in Fig. 1, which comprises encoder, decoder, bottleneck, and feature coupling modules. The encoder is primarily used to extract features from segmented images, while the decoder is utilized to restore the feature map to the resolution of the input image. The bottleneck consists of a single Transformer block to accelerate network convergence. The feature coupling module acts as a bridge between the encoder and decoder output features, focusing on both the shallow texture features of the image and deep semantic features to compensate for the spatial information loss resulting from down-sampling.

Fig. 1. The overall architecture of Resformer-Unet.

2.2. Encoder

As shown in Fig. 1 (Encoder), the encoder is composed of a CNN and Transformer branch. Specifically, the CNN branch follows the skip connection structure of ResNet18 and employs three Resblocks to extract features of multiple scales.31) Each Resblock consists of two layers of 3×3 convolutions with the same number of channels, as shown in Fig. 2(a). This design enhances the capability of feature learning while maintaining the feature dimension.

Fig. 2. The structure of the Encoder consists of two branches: (a) a CNN branch, (b) a Transformer branch.

As another branch of the encoder, the Transformer is primarily designed to compensate for the insufficiency of the CNN in global feature extraction. According to Fig. 1(Encoder), this branch is composed of three Transblocks. Particularly, each Transblock includes one Patch Merging layer and two consecutive Swin Transformer Blocks, as depicted in Fig. 2(b). The Patch Merging layer performs feature splicing and dimensionality reduction, which is used to down-sample the input features and increase the feature dimension. After each Patch Merging layer, the resolution of the features is reduced by half, while the dimension of the feature is doubled.20) As shown in Fig. 3(b), a Swin Transformer block consists of LayerNorm layer, Residual Connection, Multi-head Self-attention (MSA), and MLP. The two consecutive Swin Transformer blocks utilize distinct approaches to compute self-attention: the window-based MSA module (W-MSA) and the shifted window-based MSA module (SW-MSA). These blocks are primarily used for feature learning without changing the input resolution and dimension. The calculation formula is as follows:

  
z ˆ l =W-MSA( LN( z l-1 ) ) + z l-1 (1)

  
z l =MLP( LN( z ˆ l ) ) + z ˆ l (2)

  
z ˆ l+1 =SW-MSA( LN( z l ) ) + z l (3)

  
z l+1 =MLP( LN( z ˆ l+1 ) ) + z ˆ l+1 (4)

where z ˆ l and z ˆ l+1 express the output features of the W-MSA and the SW-MSA, separately. LN refers to the LayerNorm layer. zl and zl+1 represent the outputs of two Swin Transformer blocks, respectively.

Fig. 3. The architecture of Swin Transformer block. (a) MLP, (b) Two successive Swin Transformer blocks, (c) Multi-head self-attention module.

The MSA structure, illustrated in Fig. 3(c), includes two Linear layers and a Dropout layer, and outputs the result through Softmax. The computing formula is similar to that of previous work:25,32)

  
A( Q,K,V ) =Softmax( Q K T d +B ) V (5)

where Q, K, and V represent the query vector, key vector, and value vector, respectively, and d denotes the dimension of the query vector or key vector. The values Q, K, and V are obtained by multiplying the input embedding vector with a weight matrix that can be learned. To simultaneously consider all positions in the input image sequence, the self-attention mechanism calculates a weighted sum for each input position. Among these variables, Q is used to determine the focus of the self-attention mechanism, K represents the key information of each position in the input sequence, and V contains information relevant to each position in the input sequence. Specifically, as shown in Formula (5), the dot product of Q and K is calculated, and then the results are scaled. The purpose of scaling is to prevent the value of the dot product from being too large, which is not conducive to the stability of the subsequent Softmax function. The scaled results are transformed into a probability distribution using the Softmax function to ensure that the sum of all attention weights is equal to 1. Finally, the attention weight of the Softmax output is multiplied by V, and all positions are weighted and summed to obtain the weighted representation of each position to the current position. Figure 3(a) illustrates the structure of the MLP layer, which comprises a Linear layer, GELU (Gaussian Error Linear Unit) activation function, and Dropout layer. Compared to other commonly used activation functions, such as ReLU and sigmoid, GELU exhibits smoother nonlinear characteristics, which contributes to improving the performance of the model.33) The Dropout layer introduces randomness to the activation function, making the network training more robust.

2.3. Decoder

The decoder architecture consists of two up-sampling branches: Nearest Neighbor Interpolation (NNI) and Transformer, as illustrated in Fig. 1 (Decoder). Concretely, the NNI branch contains three Nearestup modules, as depicted in Fig. 4(a). Once the encoder features are inputted, the resolution of the feature map is increased using the nearest neighbor interpolation algorithm, followed by dimension reduction through a 3×3 convolution.

Fig. 4. The architecture of the decoder consists of two branches: (a) Nearest neighbor interpolation up-sampling branch, (b) Transformer up-sampling branch.

Similarly, the Transformer branch includes three Transup modules, as shown in Fig. 1 (Decoder). Each Transup module comprises one Patch Expanding layer and two Successive Swin Transformer blocks, as shown in Fig. 4(b). After each transition, the resolution of the feature map will be doubled, while the feature dimension will be halved.20)

2.4. Feature Coupling Module

To extract more comprehensive features from input images and make full use of feature information, the network structure includes several feature coupling modules. Firstly, the multi-scale feature information extracted by the CNN and Transformer branches is superimposed in the encoder. This approach increases the amount of information describing the image without changing the feature dimension, which enhancing the recognition ability of the network for various types of defects. One additional point is to incorporate the feature information obtained by the NNI into the Transformer branch of the decoder, thereby enriching the feature information after up-sampling. The most crucial step is to concatenate the multiscale features extracted by the encoder with the corresponding up-sampling features in the decoder through skip-connections, which can compensate for the loss of feature information during down-sampling. This method not only utilizes the semantic information from feature maps of different scales but also considers both local and global features of the image, so as to obtain competitive segmentation results.

2.5. Loss Function

The loss function designed in this work is calculated as follows:

  
Loss=ω L ce +( 1-ω ) L dice (6)

where Lce represents the cross-entropy loss used to identify defects of different categories, and Ldice denotes the dice loss applied to calculate the similarity between the segmentation area and the label mask to assist the model in predicting the defect. Compared with Lce to compute the classification loss of each pixel, Resformer-Unet places greater emphasis on foreground region mining during training. Accordingly, this work aims to increase the proportion of Ldice in the total loss function, where ω is set to 0.3, and the weight ratio between Lce and Ldice is 3:7.

3. Results

3.1. Datasets

The SD-saliency-900 dataset is a standardized and high-quality database used for hot-rolled strip surface defect segmentation.30) Figure 5 exemplifies the three categories of raw defects within the dataset, i.e., inclusions, patches and scratches. The resolution of each raw image is 200×200, and each class contains 300 images with pixel-level labels. To evaluate the usability and generality of the Resformer-Unet, 720 images are randomly selected as the training dataset, and the remaining 180 images as the testing dataset. The training dataset incorporates 240 inclusion images, 240 patch images, and 240 scratch images. Moreover, the training samples are randomly flipped or rotated to increase the diversity of training samples.

Fig. 5. The samples of various types of defects: (a) inclusions, (b) patches, (c) scratches.

3.2. Implementation Details

This work conducted experiments using the PyTorch deep learning framework on an experimental platform consisting of an Intel (R) Core (TM) i7-6700 CPU, NVIDIA GTX 3090 GPU, and Windows 10 operating system. The input image size was set to 224×244 pixels, with a learning rate of 0.001, and the cosine annealing with warmup strategy was employed to update the model, as illustrated in Fig. 6(a). In order to ensure sufficient training time, we set the batch size for network training to 24 and the maximum number of epochs to 100. During the backpropagation process, the parameters of the learned network were updated using the AdamW optimizer.34) The training loss value of the network steadily decreased and gradually converged when the training reached 2500 steps, as presented in Fig. 6(b).

Fig. 6. Network training process: (a) Learning rate change curve, (b) Total loss change curve. (Online version in color.)

3.3. Evaluation Metrics

To evaluate the segmentation performance of Resformer-Unet, two metrics are used in this work, i.e., the Dice-Similarity Coefficient (DSC) and the average Hausdorff Distance (HD). The DSC measures the similarity between the segmentation result produced by the model and the label.35) This metric mainly focuses on the internal pixels within the defect area, and a higher DSC value indicates that the defect area predicted by the model is closer to the actual marked area.36) DSC is defined as follows:

  
DSC= 2| XY | | X |+| Y | (7)

where X is the predicted segmentation result, Y is the ground truth label, | X | and | Y | represent the number of pixels in X and Y, respectively. The value of DSC ranges from 0 to 1, where a value of 1 indicates a perfect match between the predicted segmentation result and the ground truth label.

HD is to calculate the shortest distance between the pixels in the image predicted by the model and that in the corresponding label,37,38) which is defined as follows:

  
HD=( 95% ) max( h( X,Y ) ,h( Y,X ) ) (8)

where X represents the set of pixels predicted by the model, Y denotes the practical labeled pixel set corresponding to X, and h(X, Y) represents the one-way HD distance from X to Y, which is defined as follows:

  
h( X,Y ) = max aX { min bY a-b } (9)

Finally, the larger one-way distance is multiplied by 95% to remove outliers and maintain the stability of the overall value.

3.4. Performance of Resformer-Unet

The performance of Resformer-Unet on the SD-saliency-900 dataset is illustrated in Fig. 7. Figure 7(a) shows the ground truth of the dataset, where inclusion, patches, and scratches are displayed from left to right, and Fig. 7(b) is the prediction results output by our model. It can be observed that our model can effectively segment most of the surface defects.

Fig. 7. Segmentation effect. (a) Ground Truth, (b) Prediction results of Resformer-Unet.

The specific experimental results are presented in Table 1, which summarizes the mean DSC and HD for three categories of defects and backgrounds. The mean DSC for all categories is 89.96%, and the average HD is 12.03%. The background category achieved the highest DSC of 98.59% and the lowest HD of 4.95%, indicating that our model can effectively distinguish the background areas from defect regions. Among the three types of defects, patches exhibit the highest DSC of 91.43% and the lowest HD of 11.49%. In contrast, inclusions have the lowest DSC and the worst HD. These results suggest that Resformer-Unet has the best recognition performance for patch defects, while inclusion defects are the most challenging to identify.

Table 1. Segmentation results of Resformer-Unet on SD-saliency-900.

ClassesDSC (%) ↑HD (%) ↓
Background98.594.95
Inclusion82.1720.17
Patches91.4311.49
Scratches87.6711.53
Mean89.9612.03

3.5. Comparison with Other Methods

This work conducted main experiments on the SD-saliency-900 dataset by comparing Resformer-unet with eight previous state-of-the-art models: 1) U-Net;10) 2) ENet;39) 3) PSPNet;24) 4) SegNet;16) 5) Attention U-Net;40) 6) UNet++;14) 7) Swin-Unet;20) 8) TransUNet.41)

The experimental results presented in Table 2 demonstrate that the Resformer-Unet model proposed in this work achieves the highest segmentation accuracy, with a mean DSC of 89.96% and a mean HD of 12.03% on the SD-saliency-900 dataset. Compared to CNN-based methods such as U-Net, ENet, PSPNet, SegNet, Attention U-Net, and UNet++, our model achieved accuracy improvements of 2.31%, 2.41%, 1.67%, 1.38%, 1.3%, and 1.88% on the mean DSC metric. Similarly, our model achieved accuracy improvements of 2.14%, 2.42%, 1.81%, 3.21%, 0.9%, and 2.31% on the mean HD metrics. It is worth noting that PSPNet has multiple versions, and this work implements a version based on ResNet50 network. Although Resformer-Unet does not show a significant improvement on the mean DSC compared with the purely Transformer-based method Swin-Unet, it achieves an accuracy improvement of 2.36% on the mean HD, indicating that our model can obtain competitive boundary prediction. Furthermore, although TransUNet combines Transformer with CNN to strengthen features, the encoder does not take into account the local connections between adjacent blocks. Consequently, Resformer-Unet outperforms TransUNet in both mean DSC and HD evaluation metrics. In particular, in order to compare different models more intuitively, this work visualizes the segmentation effects of all methods for three types of defects, as shown in Fig. 8. The results demonstrate that the segmentation output of Resformer-Unet is the closest to the raw label (GroundTruth).

Table 2. Comparison with different methods on SD-saliency-900.

MethodsMean DSC (%) ↑Mean HD (%) ↓
U-Net10)87.6514.17
Enet39)87.5514.45
PSPNet (resnet50)24)88.2913.84
SegNet16)88.5815.24
Attention U-Net40)88.6612.93
UNet++14)88.0814.34
Swin-Unet20)89.6014.39
TransUNet41)89.1713.86
Resformer-Unet89.9612.03

Fig. 8. The segmentation results obtained from various approaches. (a) From top to bottom: GroundTruth, Resformer-Unet, U-Net, ENet, and PSPNet. (b) From top to bottom: SegNet, Attention U-Net, UNet++, Swin-Unet, and TransUNet.

3.6. Ablation Study

In order to explore the impact of different variables on the performance of Resformer-Unet, this work conducted ablation studies on the same dataset. The control variables include the method of up-sampling, the number of skip connections, and the use of pretrained weights.

3.6.1. Up-sampling Methods

The performance of the model will be significantly affected by the use of different up-sampling strategies in the decoder. According to the results presented in Table 3, the decoder combined with the nearest neighbor interpolation demonstrates superior segmentation performance. The mean DSC of the nearest neighbor interpolation increased by 1.76% and 0.45%, respectively, compared to transposed convolution and bilinear interpolation. On the other hand, the mean HD of transposed convolution and bilinear interpolation is 2.81% and 2.09% lower, respectively, compared to nearest-neighbor interpolation. Therefore, the nearest-neighbor interpolation method is the most suitable approach for up-sampling among the three methods.

Table 3. The effect of up-sampling on model accuracy.

Up-samplingMean DSC (%) ↑Mean HD (%) ↓
Transposed convolution88.2014.84
Bilinear interpolation89.5114.12
Nearest neighbor interpolation89.9612.03

3.6.2. Skip Connections

The skip connection is a crucial feature fusion link in the feature coupling module of Resformer-Unet. It is responsible for aggregating the multi-scale features extracted by the encoder and compensating for the loss of information caused by up-sampling. The scale of the fused features is determined by the number of skip connections. The experimental results presented in Table 4 demonstrate that the metrics of mean DSC and HD improve as the number of skip connections increases, which indicates that the performance of Resformer-Unet enhances as more feature scales from the encoder and decoder are fused.

Table 4. The effect of the number of skip connection on model accuracy.

Skip ConnectionMean DSC (%) ↑Mean HD (%) ↓
089.1312.99
189.6813.20
289.8412.32
389.9612.03

3.6.3. Pretrained Weights

Pre-training is a strategy for deep learning model training, which has been widely recognized in this field.42) In this work, the several experimental situations were set up to assess the impact of pre-training. These scenarios include: 1) not using any pre-training weights; 2) using ResNet18 pre-training weights on ImageNet; 3) using Swin Transformer pre-training weights on ImageNet; and 4) using both pre-training weights. According to the experimental results in Table 5, it is evident that fine-tuning strategies have proven effective for both the convolutional and Transformer modules. Specifically, after fine-tuning, ResNet18 exhibited improvements of 0.54% in Mean DSC and 1.07% in Mean HD. In contrast, the Transformer module showed even more significant improvements after fine-tuning, with increases of 1.27% in Mean DSC and 3.52% in Mean HD.

Table 5. The effect of pretrained weight on model accuracy.

Pretrained weightMean DSC (%) ↑Mean HD (%) ↓
None88.6915.55
ResNet1889.2314.48
Transformer89.9612.03
Transformer + ResNet1888.0214.84

4. Conclusions

In this work, we propose a network architecture based on U-shaped encoder-decoder for the purpose of image segmentation of surface defects on hot-rolled strips. To effectively capture both local features and long-distance semantic information interaction, a powerful feature extractor that combines the dual structure of ResNet and Swin Transformer is introduced. Additionally, multiple feature coupling modules are applied to compensate for the information loss during the up-sampling process. Extensive experiments conducted on the SD-saliency-900 dataset demonstrate that the Resformer-Unet model achieves superior performance in the task of hot-rolled strip surface defect segmentation.

However, limited by the complexity of the Transformer structure, Resformer-Unet still faces challenges in practical applications. Furthermore, our model can only handle three types of surface defects due to the scarcity of labeled segmented samples. Therefore, in future work, we will further optimize the network structure and collect more data to accelerate the deployment in industrial scenarios.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (Nos. 62172004, 61672035, and 61872004), Educational Commission of Anhui Province (No. KJ2019ZD05), Anhui University of Technology Youth Foundation (QZ202207).

References
 
© 2024 The Iron and Steel Institute of Japan.

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top