2024 Volume 5 Issue 1 Pages 124-134
The aging of concrete structures such as bridges and tunnels has led to the manifestation of damage, posing a significant problem. Particularly, the detection, evaluation, and documentation of cracks, which are a crucial indicator affecting the rate of deterioration, require an immense amount of time and effort. Consequently, the development of automatic detection methods using machine learning techniques has been pursued. However, the automatic pixel-level detection of cracks from captured images necessitates a large quantity of teacher images labeled at the same pixel level, which are costly to produce. Creating these images is not straightforward and has been a barrier to the practical implementation of image analysis methods. In response, this study developed a technique for detecting cracks at the pixel level while reducing the cost of creating teacher data, utilizing the attention mechanism. Additionally, the accuracy of this method was evaluated using captured images, confirming its equivalence to existing detection methods in terms of precision. This paper is the English translation of the authors’ previous work [Izumi and Chun, (2021). "Crack detection using deep learning with attention mechanisms" Artificial Intelligence and Data Science, 2(J2), 545-555. (in Japanese)].
In recent years, the manifestation of damage due to the aging of infrastructure such as bridges and tunnels has become a significant concern. For instance, in Japan, following the 2012 ceiling collapse of the Sasago Tunnel on the Chuo Expressway, a mandatory regular inspection every five years was instituted for bridges and tunnels. However, the number of concrete structures requiring inspection is immense, and there is a shortage of skilled inspection technicians. For the critical inspection of crack detection, the current practice involves inspectors visually checking and sketching the cracks, a process that is highly inefficient and time-consuming. Consequently, there has been an increasing amount of research into the automatic detection of cracks from photographs of concrete surfaces to improve the efficiency of crack inspections.
Cha et al.1) have developed a method that reduces the impact of varying photography conditions by processing the images, dividing them into small regions, and then using deep learning models to determine whether these regions contain cracks, thereby enabling the identification of the general areas where cracks are present. Yokoyama et al.2) have conducted multi-class classification, including the presence of cracks, multiple cracks, and construction joints. Furthermore, using social networking services, they developed and published a detector that can be used by anyone with an internet connection. Additionally, we have advanced research, as detailed in the references (3) to (8), on the precise detection of cracks at the pixel level using deep learning models, and on further improving the accuracy. Thus, research on the automatic detection of cracks using machine learning methods is actively being pursued, with numerous other studies also being conducted in this field.
There are three primary types of teacher data used for crack detection, chosen based on the specific task to be performed by AI. Fig. 1 illustrates the types of teacher data corresponding to each task and their associated creation costs. For instance, in studies like that of Cha et al.1), where original photographs are divided into small regions for individual assessment, a classification dataset as shown on the left of Fig. 1 is used. While this approach keeps the cost of creating teacher data low, it can only identify the approximate areas where cracks are present on the image. Alternatively, when using deep learning-based object detection models to identify crack regions, it is necessary to provide teacher data in the form of coordinates outlining the target, as depicted in the center of Fig. 12). Although this method is less costly than specifying cracks at the pixel level, it may not adequately address the need for more detailed information in infrastructure inspections and diagnostics, such as crack length, width, or shape. Therefore, research has mainly been conducted on methods that use teacher data, as shown on the right of Fig. 1, which involves learning and detecting based on pixel-level crack identification3),7). However, creating such teacher images requires a significant amount of time and effort. While some datasets, including those of the authors6),9), have been published, their effectiveness often diminishes with variations in photography environments and structural targets. Consequently, to achieve high accuracy, it becomes necessary to create new datasets suited to specific environments, which requires a substantial effort and is a barrier to the practical application of image analysis methods.
In this study, we have developed a crack detection method using a model that incorporates the attention mechanism into deep learning, a technique widely used in previous methods. The attention mechanism, primarily contributing to accuracy improvements in the field of natural language processing, is also gaining attention in image processing. By integrating this into a model that performs classification only, and visualizing the output of the attention mechanism during detection, we have developed a method that can identify the location and shape of cracks. This approach eliminates the need for pixel-level teacher data that has previously been required, allowing for pixel-level detection using only the classification dataset, which is less costly to create.
The crack detection method proposed in this study is illustrated in Fig. 2. We propose three types of crack detection method. Detection method [1] is a crack detection model that incorporates an attention mechanism into a deep learning model. It utilizes a classification dataset only and performs crack detection at the pixel level. Detection method [2] applies a pixel classification model to the output of detection method [1]. In this method, in addition to the classification dataset, a small amount of data, where crack locations are identified at the pixel level, is used to achieve high accuracy in pixel-level crack detection. Detection method [3] proposes potential crack-containing regions of the input image by performing pooling processing on the output of the detection method [1] and then conducts pixel classification within those regions. Each method will be explained in detail below. However, for parts previously proposed, only a summary will be provided.
(1) Image preprocessing
Preprocessing is performed to reduce the effects of shooting conditions. Initially, grayscale conversion of the captured image was done using the NTSC weighted average method. Subsequently, to mitigate the influence of light and shadow differences due to the shooting environment, as well as minor impurities on the concrete surface, correction was carried out using a median filter, as also utilized by Chun et al. (2021)6). The correction formula is shown in equation (1):
Here, i, j represent the pixel positions, Img(i,j) is the corrected image, ImgB(i,j) is the grayscale converted original image, bmax is the maximum brightness value of the original image, and ImgM(i,j) is the image after median filter processing. The filter size of the median filter used here was 41×41, similar to Chun et al. (2021)6). The processed image is shown in Fig. 3. This correction clearly reduces the effects of shadows and impurities. The following analysis will be conducted on this image.
(2) Detection method [1]: Detection using a deep learning model with an attention mechanism
As explained in the previous chapter, previous machine learning and deep learning-based crack detection methods required training of image datasets that discriminated cracks at the pixel level to detect their location and shape. However, constructing the training dataset in this manner was extremely challenging. Therefore, in this study, we train a deep learning model using an attention mechanism with a dataset that is easier to create, one which classifies the presence or absence of cracks in each image. This approach enables the automatic extraction of crack features. The weights obtained from the convolutional layer output by the attention mechanism, acquired through this learning, are used to detect the position and shape of the cracks.
a) Attention mechanism
The attention mechanism is a technique widely used in natural language processing, which learns and applies weights to focus on certain parts of the input data, thereby effectively conveying relevant information to later stages of the network output. Recently, it has been introduced into image recognition models, contributing to improved accuracy. There are various types of attention mechanisms, and attention itself does not refer to a specific model or network. The attention mechanism gained prominence when the Transformer model (Vaswani et al. 2017)10), which does not use a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN) but instead utilizes attention mechanisms and fully connected layers, was proposed in the field of natural language processing, primarily for RNN-based tasks. The Transformer demonstrated higher accuracy and speed in machine translation tasks compared to traditional RNN-based models, showcasing the utility of the attention mechanism in natural language processing. Subsequently, in the field of image processing, it has been reported that incorporating attention mechanisms into CNNs can slightly increase the computational load but significantly improve the accuracy of image classification11).
In image recognition, there are two main types of attention mechanism. The first involves branching the output in the middle of a CNN, inputting one branch into a convolutional layer of about two layers, compressing the output between 0 and 1 through activation by the Sigmoid curve, and then applying it to the other branch’s output to learn which pixels to focus on most. The second type converts the output of the convolutional layer into a vector using Global Average Pooling, and then uses fully connected layers to learn the weights applied to the CNN filters. This is known as the SE block11). In the model constructed in this study, this SE block is used. Therefore, the next section will explain only this SE block. Additionally, models that incorporate the SE block into CNNs are known as SENet.
b) SE block
As explained in the previous section, the SE block is an attention mechanism for the filters of the convolutional layer, utilized when incorporated into a deep learning model. This attention mechanism can be easily integrated into existing networks, applying weights to the convolutional filters, thus enhancing the model’s expressiveness without significantly increasing the computational load.
The computational process inside the SE block is shown in Fig. 4. Here, H and W are the height and width of the input tensor, respectively, and C is the number of channels. In the SE block, the output from the preceding convolutional layer is first subjected to Global Average Pooling to produce a vector of the same length as the number of convolutional filters. This vector is then encoded using the first layer of the fully connected layer and restored to the filter count in the second layer of the fully connected layer. The compression rate, denoted as r in the first layer, is a hyperparameter. Finally, the output is compressed between 0 and 1 through activation by the Sigmoid function. As the Sigmoid function activation brings smaller inputs closer to 0 and larger inputs closer to 1, the vector generated by the SE block is multiplied by the output of the original convolutional layer. This operation applies values close to 1 to filters that should be focused on and values near 0 to those that should not, resulting in significant filters maintaining their output while the insignificant ones nearly vanish in the subsequent classifier. This weighting of convolutional filters is known to improve the model’s accuracy. In this study, we constructed the model with C set to 16 and r to 0.0625.
c) Model development
The structure and detection flow of the SENet constructed in this study are shown in the first and second tiers of Fig. 5. For building this model, we used Keras, a deep learning library in the programming language Python, with Tensorflow as the backend engine. The model’s structure begins with a convolutional layer, after which the output is branched, with one branch fed into the SE block. Then, the output of the SE block, a vector of the filter count, is multiplied by the branched output for weighting (shown in the attention part of Fig. 5). This is then connected to the subsequent classifier comprising CNN and fully connected layers. The layer structure is detailed in Table 1.
The constructed model receives an image as input and outputs a size-2 vector, representing the probabilities of containing or not containing cracks. By training with a classification dataset to determine whether the input image contains cracks, the model automatically acquires the necessary features for classification.
The crack detection method using this trained model is shown in the second tier of Fig. 5. During detection, the attention part of the trained SENet is used to obtain the output for the test image. Since this output has as many channels as the number of filters in the first convolutional layer, they are all summed to consolidate into a single channel. In the post-processing part, normalization is performed using the minimum and maximum values, converting the output into a matrix with a minimum value of 0 and a maximum of 1. Then, a binarization process is applied with a threshold of 0.5. The resulting binary matrix is the final detection result, where 1 represents a crack and 0 represents no crack in each pixel of the input image.
This method allows for the detection of crack locations and shapes using only parts of the layers of the SENet trained on the classification dataset. Furthermore, as the model learns features automatically, there is no need for manual setting as in traditional machine learning. However, since it does not classify at the pixel level precisely and sums up the reactions of the convolutional layers, the detected area is slightly broader than the actual crack.
(3) Detection method [2]: Detection using a pixel classification model with the output of the attention part
As described in the previous section, we constructed a model capable of detecting cracks by training only with a classification dataset using SENet. This approach allows for a significant reduction in dataset creation costs. However, there is a tendency for this model to detect slightly larger areas than the actual cracks, making it unsuitable for precise calculations such as crack width. Therefore, we construct a very small pixel classification model that classifies at the pixel level using the normalized output of the SENet’s attention part developed in the previous section.
The flow of detection and the structure of the pixel-level classification model are shown in the third tier of Fig. 5. Initially, the output for the input image is obtained using the trained SENet’s attention part from the previous section. This output is normalized so that its minimum value is 0 and the maximum is 1, and it then serves as the input for the pixel classification model. The constructed model has three convolutional layers and outputs pixel-level classified images after activation by the Sigmoid function. Although this model requires pixel-classified teacher images for training, it has only 577 learning parameters, making it a very small model that can be trained with a limited amount of data. In this study, we used only 10% of the captured images for training. The reason this small-parameter model can detect effectively is that it uses normalized images as input. Traditional semantic segmentation models using convolutional operations require many convolutional layers to compute a broad range of information to determine whether an object is the target of detection. However, by using the output of SENet as the input, learning only the specific pixel characteristics and edge shapes of cracks is possible, which in turn allows for high-accuracy classification even with approximately three convolutional layers.
With this model, more detailed detection is possible with limited training data, ensuring sufficient accuracy for calculations such as crack measurement. Furthermore, due to the small number of parameters, not only does it require a small amount of training data, but it can also be operated with minimal computational resources in practical applications.
(4) Detection method [3]: Detection using a pixel classification model with a region proposal image
In this section, we propose a method of using the trained SENet and a Pooling Layer to propose crack regions, and then classify these at the pixel level using the proposed regions as input. The method of region proposal is shown in Fig. 6. First, the output of the trained SENet’s attention part is summed up into a single channel, then normalized using the maximum and minimum values. This is fed into a Max Pooling Layer for maximum value pooling and then restored to the size of the input image using an UpSampling Layer. In post-processing, binarization with a threshold of 0.5 is performed, and the resulting image, when multiplied by the input image, displays only the surroundings of the cracks.
The region proposal image obtained in this way is used for pixel-by-pixel classification using the model constructed in the previous section. Training is conducted using only 10% of the captured images, similar to the model in the previous section.
(5) Model training
This section presents the training of the deep learning models developed in the previous sections, using the dataset created for this study. The creation of the dataset and the training of each model are described in order.
a) Dataset creation
First, we created a dataset using images preprocessed as described in (1). Initially, 100 captured images of size 3456px × 5184px were randomly divided into 80 for SENet training, 10 for pixel classification model training, and 10 for testing. Furthermore, the 80 images for SENet training were divided into 70 for training data used for model parameter updates and 10 for validation data to check the generalization performance. Similarly, the 10 images for pixel classification model training were split into 9 for training data and 1 for validation data. The images were then subdivided into smaller regions of 256px × 256px to standardize their size, as shown in Fig. 7. To augment the training data, the study overlapped the subdivisions by 128px each.
The subdivided images were classified into those containing cracks and those not for SENet training, and the images for pixel classification model training were annotated for crack regions at the pixel level. The classification of the dataset for SENet training is shown in Table 2. Since cracks occupy a smaller area than the rest in the images before subdivision, the classified data becomes imbalanced. To address this, downsampling was performed by randomly deleting images from the class with more data to eliminate this imbalance in the training data.
b) SENet training
SENet was trained using the dataset created in the previous item, employing the dataset of image classifications for SENet training.
During training, the batch size, the number of images used for each weight update, was set to 32, and the training was conducted over multiple epochs until convergence, with one epoch defined as one cycle through the training data. After each epoch, the model’s training loss and validation loss were calculated, and these were used to adjust the learning rate during training. As the model’s generalization performance improves, both training and validation losses decrease. However, since validation data is not used during training, excessive epoch iterations can lead to overfitting, where the model becomes overly adapted to the training data only. To prevent this, the training was terminated when the validation loss did not decrease for 30 consecutive epochs. Additionally, data augmentation was applied during training by flipping the images horizontally and vertically to further increase the training data.
For training, the Nvidia Tesla K80 GPU was used. In this environment, the duration per epoch ranged from 92 to 95 seconds. The transition of losses for each data type per epoch during training is shown in Fig. 8. The validation loss decreased over epochs, indicating that the training was progressing normally.
c) Training of the pixel classification model using SENet’s detection results
For training the pixel classification model using SENet’s detection results, 10% of the captured images created in the previous section were used, employing the dataset annotated for pixel classification as teacher data.
Before training, the dataset’s input images were processed using SENet to generate normalized images. This process produced images masked around the cracks, which were then replaced as input images for the training dataset.
During training, the batch size was set to 32, and the training was conducted with an upper limit of 100 epochs. The training was set to end when the loss on the validation data did not decrease for 30 consecutive epochs. Data augmentation by flipping the input images horizontally and vertically was also performed to increase the training data. For training, the Nvidia GeForce 1080 Ti GPU was used. In this environment, the duration per epoch ranged from 32 to 35 seconds. Fig. 9 shows the transition of losses. The losses for the training and validation data decreased over epochs, indicating that the training was progressing normally.
d) Training of the pixel classification model using region proposal images
The training of the pixel classification model using region proposal images employed the same dataset as described in the previous section.
Before training, region proposals were made using SENet for the input images of the dataset. This process created images masked only around the cracks, which were then replaced as input images for the training dataset.
The training settings and environment were the same as in the previous section, with the duration per epoch ranging from 33 to 35 seconds. Fig. 10 shows the transition of losses. Here, there is a sharp decrease in loss around the 25th epoch, which is due to the parameter search moving into a range with a different local optimum and is not a result of the training settings. The training ended at the upper limit of 100 epochs, but the low loss values and gradual decrease suggest that the model training was sufficiently conducted.
In the previous chapter, we described three crack detection methods proposed in this study. Additionally, we conducted training for the models constructed for each method and presented the transition of losses. This chapter demonstrates the detection results for test images not used in training for each proposed method and validates their accuracy. For inference on the test images, the model from the epoch with the lowest loss in each method was used.
(1) Results by detection method [1]
The results obtained using detection method [1] are shown in the second tier of Fig. 11. It can be observed that the location and shape of the cracks in the input images are accurately detected regardless of their orientation, whether horizontal, vertical, or diagonal. This indicates that even when the deep learning model with the SE block is trained solely on a classification dataset, it is effective for crack detection.
However, as observed in the two images on the right side of the figure, some cracks remain undetected, especially those with narrower widths. The undetected cracks tend to have a smaller difference in brightness value from their surroundings and appear faint. As a result, the response of the convolutional layers is weak, and they fall below the threshold for binarization, leading to non-detection. Inspection of the normalized results of such images reveals the presence of cracks, suggesting that enhancing the contrast of the input images could be a potential improvement. Moreover, adjusting the filter size or increasing the number of filters in the convolutional layers placed before the SE block could extract more features, although it may slightly increase computational load, making it possible to detect even faint cracks.
(2) Results by detection method [2]
The normalized images generated and the detection results by the pixel classification model are shown in the third and fifth tiers of Fig. 11. This model also accurately detects cracks in all orientations, whether horizontal, vertical, or diagonal. While the normalized images initially contain many small false detection areas, their number is reduced by incorporating shape features through the small model. Compared to the results of detection method [1], the boundaries of cracks are detected more precisely. Moreover, as seen in the rightmost column, some areas that SENet alone could not detect are now detectable after further processing with the normalized images through the model.
However, there were instances, as in the central column of the figure, where false detection areas smaller than those detected by SENet increased. This is probably because areas that strongly responded in the normalized images also strongly responded in the model’s convolutions. Future improvements could include considering geometric features used in previous studies and adjusting the size of the convolution filters.
(3) Results by detection method [3]
The region proposal images generated by SENet and the detection results of the pixel classification model are shown in the fourth and sixth tiers of Fig. 11. This model, too, accurately detects cracks in all directions. While the region proposal stage includes patterns similar to cracks and holes on the concrete surface, the detection results manage to reduce these. However, undetected instances, as seen in the two rightmost images, were also observed. For extremely wide cracks, like the second from the right, the proposed region became internal to the crack, failing to convolve features such as edges, leading to non-detection. In the case of images with thin cracks, as in the rightmost image, detection could have been possible if normalized images capturing crack features with SENet, as in detection method [2], were used as input. However, detection method [3] requires learning new high-dimensional features of cracks from the proposed regions. Therefore, as the training data had fewer images with such thin cracks, detection was not possible. Future improvements could be made by increasing the number of such images in the training data and adjusting the number and size of convolution filters in the pixel classification model.
(4) Accuracy evaluation
There are many metrics to evaluate classification results in machine learning. In this study, we calculated commonly used metrics: accuracy, precision, recall, and F1 score. Additionally, we computed the Intersection over Union (IoU), a metric frequently used in object detection tasks, for accuracy evaluation.
We compared the actual crack locations with the detection results and calculated True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN) for each pixel, as defined in Table 3. Each evaluation metric was calculated using formulas (2) to (6).
Table 4 presents the results calculated using the detection results of each model for the test data and the actual crack locations (abbreviated as Method [1], Method [2], and Method [3] for Detection Methods [1], [2], and [3], respectively). For comparison, detection using Mask R-CNN12), a widely known Semantic Segmentation model capable of high-precision multi-class region detection, was also performed. For the learning detection, we used Resnet10113) as the base model and a dataset annotated for pixel classification from 10 captured images as teacher data. Mask R-CNN, while highly accurate, requires extensive learning parameters and computational resources for training, with over 40 million parameters in the Resnet101 part alone, as mentioned in Chapter 1.
The following is a discussion of metrices. Firstly, the accuracy of each method was very high, over 0.98. The precision was highest for Method [1] using SENet, indicating fewer false detections in the predicted values than the existing methods. The recall was not as high for any method compared to existing methods, probably due to the proposed methods failing to detect extremely wide or narrow cracks in the test data. However, most cracks were partially detected, suggesting significant improvements could be made by enhancing the contrast in image preprocessing or adjusting the convolutional layers before the SE block. The F1 scores were 0.40, 0.48, and 0.25 for each method, respectively. While inferior to Mask R-CNN, Method [2] improved the results by further refining the classification of Method [1] results, and considering the difference in learning parameters and datasets, these are sufficiently high values. As for IoU, Method [2], with only around 240,000 parameters including the SENet part, achieved similar results to Mask R-CNN with 40 million parameters in the base model alone. This indicates that Method [2] is superior among the proposed methods.
Comparing Method [1] and Method [3], the values of Method [3] are generally lower due to the reasons mentioned earlier: SENet could only detect parts of extremely wide cracks, and the proposed regions in the region proposal images only showed the internal parts of the cracks, losing necessary features for detection. Improving SENet’s accuracy could simultaneously improve this.
From these results, it can be said that the methods proposed in this study are capable of accurately detecting cracks, though there is room for improvement.
In this study, we constructed a deep learning model using an attention mechanism and proposed a method capable of detecting cracks with training only on image classification tasks. Additionally, we achieved more detailed crack detection by analyzing the regions deemed as cracks with a very small network. This approach enabled the detection of the location and shape of cracks without the need for pixel-level classified teacher images, which were previously essential. Consequently, this significantly reduced the cost of creating teacher images. Therefore, it is believed that the cost associated with fine-tuning when the detection target changes, which is a traditional issue, can also be reduced. Moreover, for more detailed crack detection, only a small amount of pixel-level classified teacher data is required, and since the network structure has far fewer parameters compared to traditional methods, it allows for reduced computational resources for learning and inference, making field implementation easier.
As for future challenges, as mentioned earlier, there are cases where noise increases, and false detection of P-con traces occurs. The noise issue can be addressed by considering the geometric features used in previous studies and adjusting the size of the convolution filters. False detection of P-con traces could potentially be eliminated using a model specifically detecting P-con traces alone, and we are also considering utilizing previously reported methods. Additionally, the application of these methods to other types of damage, such as corrosion detection14)-17), is promising.
Furthermore, combining the crack detection method with 3D model construction techniques can lead to the development of digital twins. We have already initiated proposals for such methods in references (18) and (19). We believe that simpler methods, like those presented in this study, are highly compatible with these combinations. Additionally, with the rapid advancements in language models, such as large language models, the integration of attention mechanism-based models is promising for the verbalization of damage in civil structures20)-22). We are currently exploring research in this area, leveraging the synergies between these advanced language models and structural damage detection methodologies.