Automatic detection of concrete floating and delamination by analyzing thermal images through self-supervised learning

Sota Kawanowa; Shogo Hayashi; Takayuki Okatani; Kang-Jun Liu; Pang-jo Chun

doi:10.11532/jsceiiai.4.3_21

Abstract

Infrared methods that can remotely detect internal damage by capturing thermal images often miss damaged areas when the judgment is made by humans. Additionally, although there have been moves to introduce autodiscovery through convolutional neural networks as part of infrared technology, such methods have not had a sufficient level of precision due to a lack of supervised training data. Hence, in this study, we focus on self-supervised learning. In self-supervised learning, even if there is a lack of supervised labels, it is still possible to realize a high degree of accuracy. Moreover, we present an example of introducing self-supervised learning via the infrared method and validate the effectiveness of the same. This paper is the English translation from the authors’ previous work [Kawanowa, S. et. al., (2022). "Automatic detection of inner defects of concrete by analyz-ing thermal images using self-supervised learning." Artificial Intelligence and Data Science, 3(J2), 47-55. (in Japanese)].

1. INTRODUCTION

Internal damage in concrete structures, such as floating and delamination, may cause concrete fragments to peel if not discovered at an early stage. Currently, the main method of inspection in relation to floating and delamination is the hammering inspection; however, a number of points requiring improvement have been raised in regard to such inspections, such as the need for vehicle restrictions and aerial work platforms, the time required, and restrictions on the scope of the inspection, as well as the fact that, currently, inspection results differ greatly depending on the experience and sensibility of the engineer. Infrared inspection exists as a method of resolving these issues. Along with this method, infrared thermography is used to visualize the temperature distribution on a concrete surface and narrow down the floating/delamination candidate areas (Fig. 1). However, damaged areas are often overlooked by human judgment¹⁾.

There have been various initiatives taken to autodiscover damaged areas from thermal images using supervised learning¹⁾²⁾. Other studies have attempted to utilize CNN, which has been widely used in damage detection studies in recent years^2)-8). Kawanishi et al.⁹⁾ cropped damaged candidate areas from thermal images in advance and learned and classified them based on whether the given candidate area was considered as "requires inspection" or "other." The results were such that out of 620 cases determined as "requires inspection," 596 cases were overlooked. Additionally, even if the threshold was set to the safe side, it was not possible to completely prevent them from being overlooked. In our research as well, in the beginning, we attempted to verify the autodiscovery accuracy of the damaged area using deep learning. However, likely due to a lack of supervised data, it was not possible to obtain sufficient accuracy using conventional supervised learning, i.e., machine learning. To create floating/delamination supervised data, in addition to capturing thermal images, it was necessary to inspect the candidate damage areas and link these candidate areas to inspection results, which requires a significant amount of time and cost. For this reason, the current situation is such that there is a chronic lack of supervised data.

Hence, in this study, we considered the implementation of self-supervised learning. Specifically, we generated a small number of images that did not have supervised labels but included damaged areas and a large number of images that did not contain the damage. Moreover, we performed prelearning with partial CNN using the self-supervised learning method. Next, we changed the area for which learning was performed with CNN to the traditional method of supervised learning and verified its detection accuracy. At that time, the detection accuracy of both cases using non-prelearned CNN and cases using the conventional method of prelearned CNN were verified and the results were compared, which confirmed the effectiveness of prelearning using self-supervised learning.

2. SELF-SUPERVISED LEARNING

(1) Overview

Self-supervised learning is a new method that promises to reduce the significant amount of time and cost required for creating supervised labels. Although the principle behind it is not yet clear¹⁰⁾, in recent years, there have been cases in which it has surpassed the level of accuracy of supervised learning. If such a method is effective with infrared rays, it will be of great use as large amounts of supervised data will no longer be required.

Self-supervised learning is comprised of something called contrastive learning as well as conventional supervised learning. With contrastive learning, large volumes of images without supervised labels are used to perform prelearning with partial CNN. This will be explained in detail in the next section. Based on this, when the prelearned CNN is moved to conventional supervised learning, it promises to produce a high level of accuracy, even with a small volume of supervised data.

(2) Contrastive learning

The flow of contrastive learning is shown in Fig.2 ¹⁰⁾. From each image in the dataset, two images of each dataset are generated through data extension. Then, two images generated from a particular image are used as the query and positive examples, and all of the other images are used as negative examples for learning purposes. This learning is query-based, and the goal is to transform positive examples into feature vectors that are similar to the query and negative examples into feature vectors that are dissimilar to the query. This part relates to the "Pretext-Task" in Fig. 2. Transformation of the feature vector is performed using the encoder and MLP (multilayer perceptron) layer.

The encoder in this transformation corresponds to a CNN. Additionally, the "is similar" evaluation generally uses cosine similarity. Cosine similarity is shown in Equation (1).

where A indicates the query feature vector, while B shows the key (positive example, negative example) feature vector. After learning, the loss function represented by Equation (2) is calculated.

where k_i shows a negative example. Once the loss function is calculated, the encoder is updated by error backpropagation. Although various optimization algorithms are used depending on the model, the most frequently used is the stochastic gradient descent (SGD) algorithm.

Through such learning, it is expected that the CNN will learn the feature representations of objects in the image. Here, the accuracy of the pretext task itself is not important as the objective is to acquire a high level of accuracy when moving the downstream tasks to CNN.

(3) MoCov2¹¹⁾

In this study, we used Momentum Contrast version 2 (MoCov2), which was developed by Facebook AI and released in May 2020, as a contrastive learning method. The flow of learning is shown in Fig. 3 ¹¹⁾. For the MoCov2 data extensions, RandomResizedCrop (size is cropped to a random size), ColorJitter (becomes darker), RandomGrayscale (Grayscale at random), GaussianBlur (blurred), and RandomHorizontalFlip (rotated left and right at random) were applied, and finally, ToTensor (converted and standardized to tensor) was applied.

For learning purposes, the queries/positive examples generated in the current minibatch and all the negative examples stored in the dictionary ("queue" in Figure-3) are used. After learning, the feature vectors for the batch size generated by each of the current minibatches are stored in the dictionary, and the oldest element in the dictionary is deleted. The elements stored in the dictionary are used as negative examples for subsequent learning. For the encoder, the ResNet50 convolutional layer and all binding layers are used. Additionally, the transformed feature vectors are converted once again by two hidden layers (2048 dimensions, and the activation function is ReLU) and finally output as a 128-dimensional vector. The activation algorithm used is SGD.

The encoder part for query use learned using McCov2 is moved to Faster R-CNN ¹²⁾ in this study.

3. ACQUIRING TRAINING DATA

(1) Creating thermal images

In this study, a total of 3,580 thermal images of bridges (512 px × 640 px in size) were used. The photography location was the Takamatsu Expressway in Japan; the floor plates, girders, overhangs, and wall railings have been captured, and the photography distance ranges from 2.6 to 49.7 m. The FLIR SC6000 camera was used for the capture. The performance of the camera is shown in Table.1.

In this study, we used an applied histogram flattening process to sharpen the image and remove the effects of thermal gradients in the structure. The actual process of removing the effect of thermal gradients in the structure involves subtracting the temperature of the target pixel from the moving average of the temperature distribution. With the input image as f (i, j) and the output image as g (i, j), the process is displayed as shown in Equation (3). u (i, j) in Equation (3) shows the moving average of the temperature distribution, which is defined in Equation (4).

In Equation (4), n is set to 300 mm, and if n is 300 mm or more, small temperature irregularities are emphasized, thereby increasing the likelihood of false positives. Furthermore, there is always 10-50 px of margin around the image, and as this is considered noise when learning, the range of all images is cropped by 32 px. Through this operation, the size of all thermal images can be set to 448 px × 576 px.

A specific example of a thermal image of floating/delamination in this study is shown in Fig. 4 along with a visible image.

(2) Creating supervised data

Of the 3,580 original images, 2,007 had supervised labels attached, and these included the results of hammering inspections for 3,168 candidate damage areas in the images and the coordinates of their locations in the images. Among them, 262 of the images included internal damage, with 360 areas of internal damage. All of the 3,580 images were checked visually in this state, and the deletion of duplicate images and additional annotation was performed. For the annotation, labelImg was used. To eliminate superfluous info during annotation, attention was paid to enclose only small areas of damage. Through this process, 381 images containing internal damage and supervised labels containing 807 areas of internal damage were obtained. These were then used as training data for supervised learning.

(3) Creating input images for contrastive learning use

To effectively learn the features of damaged areas in contrastive learning, it is necessary to prepare as many images as possible that do not include damaged areas. In addition, even if the thermal images generated in the (1) (Creating thermal images) process are used in contrastive learning as is, it is predicted that they will not be able to skillfully learn the features of damaged areas. This is because the damaged area is too small in relation to the overall image.

Based on the above, we first cropped images 224 px × 224 px in size, as shown in Fig. 5, from 381 original images, in which damaged areas were shown. The pink frame in Fig. 5 represents the images with damaged areas, while the blue frame shows images without damaged areas. Here, the 224 px × 244 px format is because ResNet, the MoCov2 encoder, randomly crops a 224 px × 224 px area from the input image for training. For cropping, supervised labels of each of the damaged sections were used.

Next, for the 2,983 original images, for which the damaged section was not shown, four 224 px × 224 px images were cropped as shown in Fig. 6. The cropped areas were randomly determined for each image and care was taken to ensure that the same locations were not included in multiple images.

Using the above operation, a total of 13,089 images were obtained, with 538 images including damaged areas and 12,551 images not including damaged areas. These images were used as input images in McCov2.

4. AUTOMATIC DETECTION BY MOCOV2

(1) Training by MoCov2

8 GPU is required to perform learning using McCov2. In this study, we used AWS. The AMI was Deep Learning AMI (Ubuntu 18.04), Version 54.0. For the instance type, p2.8xlarge was selected and conda_pytorch_latest_p37 was used. The MoCo¹³⁾ code was shared in Github, and MoCov2 could be used simply by tweaking the settings. However, this code was designed under the assumption that learning would be performed using Image Net, and there were many disadvantages when reflecting this, in its original form, in our study. Therefore, the points that we changed ourselves are summarized below.

Two square images are cropped from the input image at random sizes for data expansion. The default size is [0.2, 1]. Here, the figures represent the length of the sides of the cropped image in relation to the length of the sides of the input image. However, it is thought that, to efficiently learn the characteristics of the damaged section through CNN, it is necessary for the query and positive example to contain part of the damaged area. Based on this, as the damaged area in this study is small compared to the subject in ImageNet, we changed the crop size to [0.6, 1]. Next, the size of the dictionary was 65,536 by default. However, as the total number of datasets was 13,089, the dictionary size was set to 10,240. Generally, the size of the dictionary should be 50%-100% of the whole and be a multiple of the batch size (256 in this study). These are the changes that we made in this study.

In the downstream task, as K = 5 K-fold cross validation is performed and learning is performed five times in MoCov2. At this time, we removed all the input images from MoCov2 that were cropped from the original images that served as test data in the various downstream tasks from the MoCov2 training data. The number of McCov2 training data images after this operation is shown in Table 2.

Furthermore, in this study, 800 epochs of learning were performed. The transition in the loss function is shown in Fig. 7. Although more epochs in MoCov2 are considered to lead to better learning, when considering the prior analysis, loss function transition, and analysis time, we judged 800 epochs to be sufficient. We also changed the scheduler to function at 480 or 640 epochs. The results of the training using dataset1 in Table 2 are shown in Fig.7. The final value of the loss function was ~4.7, regardless of the dataset used.

(2) Transfer to Faster R-CNN

The transfer to Faster R-CNN can also be performed using codes from Github. In the code used in this study, learning parameters are extracted from part of the prelearned CNN and the name of each layer is changed to one defined by Faster R-CNN. When using Faster R-CNN, the Pytorch-based object detection library, Dectron2, developed by AI, is used. Additionally, supervised data is converted in advance to a COCO dataset.

When transitioning, the anchor box area is transformed to 8², 16², 32², and 64². This is determined based on the supervised label area. In addition, as a comparison with the MoCov2 detection results, non-prelearned CNN and the R-50 pkl model of CNN prelearned with Faster R-CNN using ImageNet were used. Based on the previous validation, CNN prelearned with MoCov2 and without prelearning were trained for 90,000 iterations and CNNs prelearned with Faster R-CNN were trained for 75,000 iterations. For each iteration, two images were learned. The reason for only CNN, which had prelearning using Faster R-CNN, was trained with 75,000 iterations because over-training at 90,000 iterations resulted in a significant drop-off in detection accuracy. In terms of the schedule, the ratio of max iteration and schedule values in the defaults were modified so that it does not change.

Google Colaboratory Pro+ was used for analysis. The allocated GPU was a Tesla V100-SXM2. In addition, there was ~13 h of 75,000-iteration learning and ~16 h of 90,000-iteration learning. The final loss function for each learning was ~0.025. In Fig. 8, an example of a transition in the loss function in learning that transitioned to prelearning CNN with MoCov2 is shown.

(3) Accuracy validation

As stated in (1) (Training by MoCov2), K = 5 K- fold cross validation is performed. However, as discrepancies can occur in the results using the same dataset, the mean of the results obtained by learning the same dataset three times was used to verify accuracy. The dataset was divided, as shown in Table 3, to avoid the generation of bias in the number of damaged areas.

Furthermore, when evaluating the model, we used average precision (AP). When calculating AP, precision and recall are calculated in the order of the bounding box with the highest confidence (hereinafter described as BBox). Precision and recall are calculated using Equations (5) and (6).

The "up to this point" in the above equations refers to the statistical results for all BBoxes for which the confidence level is higher than that of the BBox in focus. In our evaluation conducted in this study, we visually confirmed whether they had been correctly detected one by one. The AP definition equation based on this is shown in Equation (7).

where r expresses Recall and p (r) expresses Precision as a function of Recall. The rectangular approximation shown in Fig. 9 was used for the actual integral calculation.

The correct answer and detection rate as defined in Equations (8) and (9) were calculated based on the obtained detection results.

(4) Detection results

(5) Discussion of results

Firstly, a comparison of the MoCov2 detection results and the detection results without prelearning are shown in Figure-11. The detection result without prelearning is output in green, and the visible image is also shown simultaneously. The description of the detection results using supervised labels and MoCov2 is the same as in Fig. 10.

We can say, based on Fig. 11, that the damaged areas can be correctly detected using prelearning with MoCov2. Even when comparing the indicators, based on Table 5, by prelearning with MoCov2, the correct answer rate increased from 56.1% to 69.2% and the detection rate from 51.0% to 53.7%. Additionally, based on Table 4, we can see that even with an independent AP, MoCov2 had a higher number by just 4.005. Based on the above, it can be said that by learning with MoCov2, CNN was able to reliably learn the features of damaged areas.

Further, Fig. 12 shows a comparison of the MoCov2 detection results with the prelearned CNN detection results using Faster R-CNN. The prelearned CNN detection results using Faster R-CNN are shown in orange.

MoCov2 achieved a detection rate of only 0.6%, which was better than the detection rate of a CNN with prelearning based on Faster R-CNN, which is considered to be more accurate than conventional methods. However, Faster R-CNN had a 0.7% higher correct answer rate, and the independently calculated AP was only 0.361 higher for the Faster R-CNN. When compared with Faster R-CNN, there were a lot of incorrect answers; however, with a correct answer rate of 69.2%, there were few wasted detections. There was no clear difference even when comparing the images of the detection results with Faster R-CNN; however, there were a relatively large number of cases in which MoCov2 mistakenly detected metal and sky areas as damaged areas. An example is shown in Fig. 12.

One cause of the large number of false detections in metal and sky areas may be the fact that the feature representation of damaged areas was not sufficiently acquired during the prelearning stage in MoCov2. However, in this study, as the images input to MoCov2 had already undergone modification, it was considered necessary to simply increase the number of thermal images used for training for achieving further accuracy improvements.

5. CONCLUSION

In this study, we investigated a method of implementing self-supervised learning using infrared. Self-supervised learning was implemented and the detection accuracy was compared with that of a CNN without prelearning, which demonstrated the effectiveness of self-supervised learning. Additionally, we were also able to demonstrate that the detection accuracy is comparable to that of supervised learning CNN prelearned on ImageNet. Here, the number of base images used for learning with MoCov2 was 3,364, which was exceedingly small compared to the 1 million or more images with ImageNet. When considering this difference, it is considered that the fact that the model using MoCov2 achieved a level of accuracy to rival that of detection with Faster R-CNN prelearning demonstrated the potential of self-supervised learning. Additionally, when implementing self-supervised learning, if thermal images in their current state were used in their original form in contrastive learning, the small size of the damaged areas compared to the image as a whole demonstrates that it is unlikely that CNN will learn the features of the damaged areas efficiently. For this reason, we cropped the damaged areas before using them in contrastive learning. In the end, however, we consider the ideal situation would be to be able to use the original thermal images in contrastive learning. In this case, it is necessary to prepare a thermal image in which the damaged section is centered as much as possible, made as large as possible, and as clear as possible.

Issues for future investigation have been summarized below.

(1) Accumulation of training data

As we were only able to use thermal images from a limited area of Shikoku in this study, this meant that we were unable to confirm the general applicability of this method with respect to other areas of Japan. The implementation of CNNs in the field of inspection is currently attracting attention; therefore, it is necessary to verify how much the detection accuracy will be improved if a more diverse range of supervised data is accumulated. Unfortunately, the creation of supervised data requires detailed inspections of candidate damaged areas, and the candidate areas need to be accurately matched to inspection results. This can be very time consuming and costly.

(2) Accumulation of thermal image data

Compared to when creating supervised data, thermal image data can only be accumulated by capturing it. Additional testing is required to determine how much accuracy improvement can be expected if we increase the number of thermal images and perform contrast learning. Resolving this issue promises to bring about great improvements in detection accuracy.

(3) Changes in the method of capturing thermal images

Thermal images have a lower resolution compared to visible images. In addition, in terms of floating and delamination, there are many small defects and temperature differences that do not show up clearly on the thermal image. The features of damaged areas for which detection is difficult cannot be sufficiently learned by CNN even if self-supervised learning is used. Moving forward, to be able to improve the detection accuracy of this type of damaged areas, in addition to capturing thermal images with a higher resolution camera, innovations in the method of capturing thermal images, such as displaying the damaged areas in the center and standardizing the capture distance and angle, are required.

(4) Integration with 3D models

Since heat transfer is a three-dimensional phenomenon, fusion with a three-dimensional model is essentially meaningful. Therefore, it is important to integrate the method developed in this study with BIM models in the future. The authors have already conducted research on the fusion of the analysis results of exterior images and 3D models^14)-16), and this knowledge can be utilized.

ACKNOWLEDGMENT:

This research is supported by JSPS KAKENHI Grant Numbers 21H01417.

References

1) Chun, P. J. and Hayashi, S. : Development of a Concrete Floating and Delamination Detection System Using Infrared Thermography, IEEE/ASME Transactions on Mechatronics, Vol.26, No. 6, pp. 2835-2844, 2021.
2) Dung, C. V.: Autonomous concrete crack detection using deep fully convolutional neural network, Automation in Construction, Vol. 99, pp.52-58, 2019.
3) Tao, X., Adak, C., Chun, P. J., Yan, S., and Liu, H.: ViTALnet: Anomaly Localization on Industrial Textured Surfaces with Hybrid Transformer. IEEE Transactions on Instrumentation and Measurement, 2023.
4) Yamane, T. and Chun, P. J.: Crack detection from a concrete surface image based on semantic segmentation using deep learning. Journal of Advanced Concrete Technology, Vol. 18, No. 9, pp.493-504, 2020.
5) Ren, Y., Huang, J., Hong, Z., Lu, W., Yin, J., Zou, L., and Shen, X.: Image-based concrete crack detection in tunnels using deep fully convolutional networks. Construction and Building Materials, Vol. 234, No. 20, 117367, 2020.
6) Chun, P. J., Yamane, T., and Maemura, Y.: A deep learning‐based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer‐Aided Civil and Infrastructure Engineering, Vol. 37, No. 11, pp.1387-1401, 2022.
7) Shi, J., Taher, S. S. A. D., and Dang, J.: Comparison of semantic segmentation and instance segmentation based on pixel-level damage detection. Intelligence, Informatics and Infrastructure, Vol. 2, No. 2, pp.46-53, 2021.
8) Chun, P. J., Izumi, S., and Yamane, T.: Automatic detection method of cracks from concrete surface imagery using two- step light gradient boosting machine, Computer-Aided Civil and Infrastructure Engineering, Vol. 36, No. 1, pp.61-72, 2021.
9) Kawanishi, K. , Hayashi, S. , Hashimoto, K. , Ujike, I. and Chun, P.J.: Automatic damage identification technology for infrared thermography method, Intelligence, Informatics and Infrastructure, Vol. 1, No. 1, pp.382-391, 2020. (in Japanese)
10) Ashish, J. , Ashwin, R. B. , Mohammad, Z. Z. , Debapriya, B. and Fillia, M.: A survey on contrastive self-supervised learning , Computer Vision and Pattern Recognition, Vol. 9, No. 2, 2021
11) Xinlei, C. , Haoqi, F. , Ross, G. and Kaiming, H. : Improved Baselines with Momentum Contrastive Learning, Computer Vision and Pattern Recognition, 2020．
12) Shaoqing, R. , Kaiming, H. , Ross, G. and Jian, S. : Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Computer Vision and Pattern Recognition, 2015.
13) Kaiming, H. , Haoqi, F. , Yuxin, W. , Saining, X. and Ross, G.: Momentum Contrast for Unsupervised Visual Representation Learning, Computer Vision and Pattern Recognition, 2019．
14) Yamane, T., Chun, P. J., Dang, J., and Honda, R.: Recording of bridge damage areas by 3D integration of multiple images and reduction of the variability in detected results. Computer-Aided Civil and Infrastructure Engineering, 2023.
15) Yamane, T., Chun, P. J., and Honda, R.: Detecting and localising damage based on image recognition and structure from motion, and reflecting it in a 3D bridge model. Structure and Infrastructure Engineering, pp.1-13, 2022.
16) Yamane, T., Ueno, Y., Kanai, K., Izumi, S., and Chun, P. J.: Reflection of the position of cracks in a 3-d model of a bridge using semantic segmentation, Intelligence, Informatics and Infrastructure, Vol. 2, No. 1, pp. 11-17, 2021.

Corresponding author

Register with J-STAGE for free!