2023 Volume 4 Issue 2 Pages 35-49
In recent years, unmanned aerial vehicles (UAVs) have been increased in civil engineering and infrastructure maintenance due to their potential to detect cracks in asphalt pavement, especially in riparian areas. Vegetation growth in riparian areas can create benefits for the infrastructures, e.g. embankment consolidation, but can also cause cracks and potholes to form. Local authorities should proactively manage vegetation growth in the riparian asphalt pavement to avoid these adverse effects. The monitoring before the maintenance operation also needs lots of time, and the efficient approach to understanding the work amount in the large-scale area is considered. UAVs assisted with computer vision algorithms, such as the You Only Look Once version 7 (YOLOv7) object detection model, have shown great potential in detecting and segmenting riparian road asphalt pavement cracks. This approach cannot just locate the cracks and also segment the cracks in the instances with pixel (px)-based sizes. This study provides three models derived from the divided dataset, one custom dataset with several bounding box sizes (i.e., 20-, 30-, 100-px), and two public datasets using asphalt pavement surface damage type and instances to annotate the crack. Based on the above results, the resulting inference was taken to compare with the True Label in mesh and had around 90% accuracy (i.e., Recall and F1).
Riparian crack monitoring involves observing and recording changes in cracks. This practice is crucial in assessing the stability of riparian construction and identifying potential hazards to human safety and the environment. And considering the situations of the crack transformation process shown in the Fig.1, the phenomenon of grass grown in step d is the most remarkable features of the riparian area cracks. Thus to avoid the pothole generations in step e (not included in this study), earlier locating and mesh-based extracting the cracks in step b and d are also necessary.
The crack survey is time-consuming and professional-expertise in need. Until now, the vehicle-plat-formed AI-based asphalt paved road crack monitoring methods have been developed and applied in the practical road crack-based management1), 2), 3). Although the mentioned technology can support the administrators to improve maintance and management, there are also some of the top-surfaced asphalt paved road in the riparian area, that are too dangerous or difficult for the viechcle and personnel to entry, and there is also no general standard for the crack detection. Based on the mentioned limitation, the integration of drones with computer vision algorithms has seen an increase in its usage for infrastructure inspections in the field of civil engineering4), 5), 6).
This study presents a method for crack detection in riparian road asphalt pavement using a drone equipped with a divided digital camera for custom data collection, the You Only Look Once version 7 (YOLOv7) object detection model7) for data analysis, and the public datasets for data supplement, individually. The YOLOv7 model, a state-of-the-art object detection model, was trained in this study on the datasets of asphalt-paved or concrete cracks. The trained models were achieved to detect the cracks in new images of riparian road asphalt pavement, that were derived from the drone collection.
Experimental results of the crack detection attended to show that the combination of the drone technology and the YOLOv7 model has the potential in enhancing the efficiency and accuracy of crack detection in riparian road asphalt pavement. But there are also some cracks that cannot be identified by the trained models. This phenomenon is basically related to the similarity of the target size derived from input data (i.e., train/valid data) and test data, the contrast of targets and background, and period-based brightness. Future research should aim to improve the robustness and accuracy of the models based on the mentioned several points.
(1) Study site
Fig.2 illustrates the study site, which is located upstream of the Chikuma River, a state-controlled firstclass river in Japan that flows through Nigata and Nagano Prefecture into the Sea of Japan. In this site, the riparian asphalt pavements are suffering from the deformation derived from the cracks in the road. To understand the detailed situation of this site, thus one of the asphalt-paved road section is selected as the study site of this work.
(2) Drone-related parameters determination
In order to be able to assess the cracks in the images derived from the drones simply by visual inspection, a relatively small Ground Sample Distance (i.e., 2.5 cm GSD with around 100 m fight height) was chosen as the drone-related parameter in this study. Notwithstanding, as shown in Fig.3, the camera lens is settled vertically to the ground. The device was used to collect the custom train/valid dataset and images for accuracy assessment.
(3) Objects of this work
Considering the crack situations in the Fig.4 for accuracy assessment, there are several species of the cracks chosen as the objects in this study, including alligator, lateral and longitudinal cracks.
In the object detection models, if the objects for inference that are too much different from the objects in training dataset, recognitions (i.e., location, classification) are always difficult. In this study, considering of the accuracy improvement on these objects, additional implements of the dataset are necessary.
But the generation of the asphalt-paved crack dataset with annotations in several weather-, light-, and road-situation is time-consuming. Thus instead of annotating the additional images, and solving the mentioned data-lack problem using the published dataset is in consideration. And alternatively, individual-crack-related method is not enough for assessing all the cracks, especially when facing with the complex crack species (i.e., alligator cracks), the individual-crack-related method has some limitation. Therefore, this study desires to improve the recognition accuracy on these complex cracks.
Furthermore, recognizing the widths of the cracks using object detection approach also has some limitations, a method that can recognize the instancebased cracks is in the consideration after the crack detection.
(4) Crack datasets in this work
The crack datasets in this work mainly comprise of two parts, customand public-one. As performed in upper part of the Fig.5, the custom one mainly comprised of the uniformed bounding box sizes, which include information on the location and size of the cracks in each image (i.e., the bounding box sizebased crack dataset).
The public dataset with several crack types, is called "Road Damage Dataset 2020"8) or "RDD 2020" in the publication, annotations depend basically on the crack species (i.e., the crack speciesbased dataset in this research). And the dataset with instance cracks is derived from the "Top Transportation Datasets" project in the Roboflow Universe9) for visualizing detected cracks on concretes using instance segmentation.
The images in RDD 2020 were collected from various road or concrete types, such as highways, city roads, and rural roads in different weather (i.e., sunny, cloudy, and rainy as displayed in the middle part of the Fig.5). And capturing a range of crack species, such as linear cracks, alligator cracks. The annotations in the crack species-based dataset include information on the location, size, and the species of the cracks in each image.
The crack-specie-based classes provide a systematic and detailed representation of road damage and its potential impact on road safety. And the class names and identifications are mainly based on the road maintenance and repair guidebook 2013 design and subcommittee in Japan.
Remarkably, the public dataset used a vehicle-plat-formed smartphone with a tilted camera angle to collect the dataset. And this dataset was collected in several countries, e.g. Japan, India. In this study, to keep the information between training and testing dataset as similar as possible, just the images taken from Japan are chosen as the additional supplement for the custom dataset.
The images in the instance segmentation dataset as displayed in the underneath part of the Fig.5 are mainly derived from the concrete with more detailed position points rather than the bounding box positions. And the images were taken very close to the cracks, so the details can be observed clearly that are much easier for the model to learn the features. Noteworthablely, the images for accuracy assessment are taken from 100 m height, that the features are different from the close distance, thus the resolution adjustments are necessary in the instance segmentation inference.
Based on the raw image in the Fig.6 (a), as shown in the Fig.6 (b), Multi-uniform size-based crack dataset is mainly derived from the bounding box size (i.e., 20-, 30-, 100-px), and 100 images were collected from a bird view using the UAV-platformed digital camera. These 100 images with the size of 6000px × 4000px collected by the mentioned drone were cropped into the same size, i.e., 600px × 600px each image. From the cropped images, 1564- and 224- images with the cracks are selected for training and validation, individually. Worth mentioning, the overlapped bounding boxes with different sizes are existed like the sample in the Fig.6 (b).
On the contrast-side, as displayed in the Fig. 6 (c), bounding box sizes of the species-based dataset are totally derived from the actual sizes of each individual crack. The 4-crack species samples and class names shown in the Fig. 6 (e) and (f) are the targets for annotating. 7744- and 880- images with the cracks are selected for training and validation, separately.
In the Fig.6 (d), except of the bounding box position as same as the other ones in Fig.6 (b) and (c), instance segmentation annotations have tighter boundaries around objects and fewer missing detections overall, and provides more precise and detailed information. To put it differently, the annotations for the instance segmentation include the class label for each pixel in the image, rather than just for the object as a whole target, that has reduced the effect of the background on the accuracy as much as possible. 3717- and 200- images with the cracks are selected for training and validation, separately.
(5) Model
YOLOv7 shown in the Fig.7, is a state-of-the-art object detection model that utilizes a single convolutional neural network to perform object detection and classification on an input image. This model leverages the latest advancements in computer vision and deep learning to deliver accurate and fast object detection results. YOLOv7 employs an anchor-based approach to object detection, utilizing anchor boxes to detect and classify objects within an image.
The model uses multiple parallel layers to process an input image in a hierarchical manner, allowing it to learn fine-grained features and perform object detection with high accuracy. Additionally, YOLOv7 utilizes techniques such as cross-stage partial connections and mosaic data augmentation to enhance its ability to generalize to new data and improve its accuracy on a variety of tasks. YOLOv7 released the instance segmentation module (i.e., YOLOv7-seg), the data preparation and usage are derived from YOLOv5, and the algorithm is interlinked with the original YOLOv7 object detection weights.
(6) Model-related parameter setting
As shown in the Table 1, YOLOv7 and YOLOv7-seg model training are basically related to following model parameter setting, i.e., batch size, epochs, learning rate, optimizer, and image size for input. Batch size refers to the number of images that are processed at once during training. A larger batch size can result in faster training, also require more memory on the Graphical Processing Unit (i.e., GPU). Epochs refer to the number of times the entire dataset is passed through the network during training. A larger number of epochs can improve the accuracy of the model.
(7) Mesh-based evaluation method
Broadly speaking, the evaluation metric for object detection models are mainly based on Precision and Recall values. And how to identify the individual crack for a fair evaluation is comparably difficult, thus it is challenging and strict to prepare a test dataset with annotations for all the individual crack correctly with a uniformed size. So in this study, instead of the index for evaluating the crack one by one, the area with the crack detection are in consideration. The minimum unit for the evaluation of the area with cracks are one mesh, and the mesh size are 10- and 50-px.
As shown in the Fig.8, the trained YOLOv7 models derived from the custom and public datasets predict the raw images in the test dataset, separately. And the inferred bounding-box parts of the results derived from the trained models (i.e., size-based and species-based results) are being selected without considering the species classes. Then the numbers of the meshes with cracks in the union area derived from the inference result are being used to compare with the true label (i.e., judged visually using UAV photography and marked by the authors). After the comparison shown in the Fig.9, seven samples in the Fig.10 are selected for observing the miss-detected cracks using multi-uniform size model. After extracting the 10-px mesh-based crack numbers from the true label and results in the Fig.9 (a~g), a scatter plot in the Fig.11 performed the relationship between the crackbased numbers in true label and result.
Fig.12 displayed the process of transferring the segmentation result derived from the YOLOv7-seg model to the 10- and 50-px mesh-based result. And in the Fig. 13, more details for the results in the whole study site have been shown. Remarkably, comparing with the YOLOv7 model, YOLOv7-seg model need more details to be much tighter to the borders around the targets, that the image size enlargement is necessary.
(1) Multi-uniform size-based results (custom dataset, YOLOv7)
The reason of choosing the uniform size bounding box for annotating the cracks is totally based on the randomness of the practical cracks in the asphalt pavement. For including all the sizes of the cracks, several uniform sizes are chosen basically on the same crack. The inferred bounding boxes are overlapped by each other, and some smaller ones (i.e., 20-, 30-px) are also included in the bigger one (i.e., 100-px). Noteworthy, some of the connections between the bigger ones without 100-px bounding box, as shown in the Fig.9 (b), the smaller ones have covered these blanks. But the Fig.10 also showed that some blanks part which need to be improved.
From the result, there are some crack parts with specified features that cannot be recognized by the trained model. As shown in the Fig.9 (b~e), the cracked asphalt pavement has some familiar features that some grasses have grown in the blank space in the crack. Because of lacking the similar images with grass grown in the crack crevices for the training/validation dataset (i.e., with-grass images), the recognition of these cracks in the test dataset are also difficult until now.
Another point that needs to be mentioned that each of the Fig. 9 (a~g) has the alligator crack in the image. But not all the alligator cracks are detected individually. That is because of lacking the images with alligator cracks in the training dataset. And it is an issue that need to be solved for improving the accuracy.
(2) Crack species-based results (public dataset, YOLOv7)
For supplying the custom dataset that lack of robustness, RDD 2020 dataset has also been trained to infer the cracks in the study site. Several alligator and linear cracks were detected in Fig.9 (a~g). Comparing with the multi-uniform size-based results, the crack species-based results can replenish the shortage of the with-grass and alligator crack detection ability. But because of the camera angle and SGD used in RDD 2020 dataset are too different from the images for testing, the confidence of the recognition targets is all less than 0.3.
In the Fig.9 (e), the tools on the grassland has been misclassified as the D40 (i.e., pothole). This phenomenon has performed that the usage of public dataset also has some limitations, especially when the inference images have too many differences from the images for training.
From the combination of the multi-uniform sizebased and the crack species-based results, to some degree, the possibility of using the different GSD, camera angle dataset for the supplement has been proved. But to understand the accuracy of the results after supplement, only inferred bounding boxes are not enough, especially when the true labels for the test dataset are not prepared with the reasonable standards. In this research, for the reasonable assessment, the extraction using mesh is necessary.
(3) 10-px mesh-based crack numbers (YOLOv7)
Based on the multi-uniform size (i.e., 20-, 30-, 100-px) derived from the custom dataset, 10-px mesh is matching the need for assessing the crack detection accuracy and the annotation labor is also comparably reasonable. Because the crack detection is one of the yearly monitoring activities in the riparian area, roughly locating the crack and marking the distribution map are most considerable for the policy makers. Derived from the mentioned points, Fig.11 has showed the results that the crack species-based supplied results can detect and locate over 90% of the cracks (i.e., y = 0.983x in Fig.11). But the result is also very rough and without detailed size information.
(4) Instance-based results (YOLOv7-seg)
Following the needs of more detailed crack information, the model using more comprehensive annotations has been trained using a public dataset. This public dataset is mainly derived from the images taken with the close distance to the targets, that included but not limited to concrete walls and asphalt pavement. Shown in the Fig.12, there are 10 steps for getting the 10- and 50-px mesh-based segmentation results on the crack targets.
Cropping the 600px × 600px raw images used in the Fig.9 into 250px × 250px is the first step. And comparing with the object detection methods more details are in need, the image enlarging and color dodging are necessary for the inference. So the images in the step B have been enlarged 10 times from 250px × 250px to 2500px × 2500px in the step C. The trained YOLOv7-seg model inferred the enlarged images in the step D, and the inferred masks overlapped on the enlarged images for the visualization. Noteworthy, the inference results are derived from the 0.0001 confidence value.
After step C and D, the enlarged images were shrunk 10 times back to the original sizes, and the 10- px mesh-based TL and inference results were made in the step E (red color) and F (grey color). Area of overlapping the step E and F is called TP (i.e., true positive) using yellow color in the step G, that means the meshes have been correctly identified. For recognizing the larger mesh-based results, 50-px meshbased TL, inference, and TP have been extracted in the step H, I and J, individually.
In the Fig.13, an ortho-photograph after color dodging was showed in the step C, then following the flow chart shown in the Fig.12 from step D to J, the results in the study site are ready for the mesh-based extraction.
(5) 10- and 50-px mesh-based crack numbers (YOLOv7-seg)
Focusing on the results in the Fig.13, crack numbers are extracted from 10- and 50-px mesh-based results. As performed in the Table 2, 10-px mesh-based Recall is over 0.75, that means just around one quarter of all the labelled cracks are not detected correctly in the 0.25 m ground mesh. If the mesh size has been enlarged 5 times, the F1 score can be increased from 0.64 to 0.88 in the 1.25 m ground mesh. For the yearly river monitoring, 1.25 m ground mesh is detailed enough in the road asphalt pavement assessment. Especially considering the long distance images collections and high quality image standard for the crack detection, until now, 0.025 m GSD is fitting for this riparian monitoring mission.
Derived from the 50-px cracked mesh distributions, the policy maker can also make the asphalt pavement renew plan without time-consume vehicle driving for the data collection, at the same time that the distribution mapping can also have a believable Recall (i.e., 0.95).
(6) Discussion
From the above results, the abilities of solving the object detection and instance segmentation tasks using YOLOv7 and YOLOv7-seg algorithms have been well-proved individually. Located crack numbers using YOLOv7 in the 10-px mesh have been 90% correctly counted, that means unnoticed cracks are very less. Based on the detailed annotations, YOLOv7-seg algorithm in the 50-px mesh has supported to detect the mesh-based cracks with the 0.88 F1 score. These two approaches have all proved that the cracks can be located and detected with a reasonable standard. But these methods also showed the limitation of the riparian crack monitoring, i.e., the quality of the training/validation dataset need to be comparable high both in the images and annotations, which take lots of time.
Generally speaking, the trained models in this research have proved with the comparable high assessment criteria, but also existing some limitations, like lacking of the robustness in weathers and crack species, the training/validation dataset are not enough, assessment standards for cracks are difficult for the large-scale area. Then this research also showed the possibility of using the YOLOv7 and YOLOv7-seg models to support the UAV-based riparian asphaltpaved crack monitoring. In the future work, if the UAV-based images can be taken by the camera with higher resolution zoom-in function, the current accuracy can be more improved.
This study presented methods for crack detection and segmentation in riparian road asphalt pavement using a drone equipped with a divided digital camera for data collection, the YOLOv7 model for data analysis, and public datasets for data supplement. The combination of drone technology and the YOLOv7 model showed potential in enhancing the efficiency and accuracy of crack detection and segmentation. However, there were limitations in detecting and segmenting the certain cracks due to complex target shapes under the random contrast and brightness.
Various crack datasets, including uniformed bounding box size-, species-, and instance-based, were used for training and validation. The YOLOv7 (YOLOv7-seg) model with its anchor-based approach and advanced techniques, achieved reasonable crack detection and segmentation, with both around 0.9 F1 value derived from the mesh-based assess approach in the detected crack areas around. Noteworthy, the YOLOv7-seg model, which performed with segmentation result, required tighter boundaries of the cracks for accurate results. Overall, this study highlighted the potential of UAVs and computer vision algorithms for efficient and accurate crack detection in riparian road asphalt pavement.
The application of the YOLOv7-seg model in asphalt-paved crack segmentation has demonstrated that the close-distance dataset can be effectively used in remote sensing tasks. Typically, remote sensing tasks involve analyzing and interpreting data collected from a distance, such as satellite or aerial imagery. The close-distance dataset in this study has inspired the authors to consider it a valuable supplement to the crack feature in future work, mainly due to its relatively more straightforward obtainability. But the generations of the annotations also need a large amount of the human-labeling. Recently, as performed in Fig.14, a new AI model called Segment Anything Model (i.e., SAM) has empowered the generations of Instance Segmentation annotations, that can reduce the burden of the time-consuming humanlabeling for the researchers. SAM10) is a prompt-able segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training.
The limitations of the trained YOLOv7 model derived from the random brightness and contrast also need to be in consideration. To some degree, the technologies like data augmentation can improve the accuracy of detecting and segmenting the cracks under these random situations. But then it is difficult to adjust the thresholds or the possibility of the parameters without considering the weather conditions. If reasonable Parameters setting is necessary, considering the standard evaluation approach of the brightness and contrast value is an acceptable opinion.
Theoretically, the method of this study can be applied to any asphalt pavement. In the future work, the trained model in this study for the other applications (e.g. monitoring the craks on the bridge feet concrete) need to be proved. Drone-based riparian asphalt paved road crack monitoring is one of the riparian environmental monitoring tasks (e.g., waste pollution detection and segmentation, land cover classification). After the analysis of the mentioned monitoring tasks, the results can also be applied in the digital twin application (e.g. yearly crack changing visualization by observing the comparison of multiple crack-based segmentation maps), which can collaborate with road administrators to improve maintenance and management.
This research was supported in part by the Electric Technology Research Foundation of Chugoku.