2023 Volume 4 Issue 3 Pages 1-11
The importance of ultrasonic nondestructive testing has been increasing in recent years, and there are high expectations for the potential of laser ultrasonic visualization testing, which combines laser ultrasonic testing with scattered wave visualization technology. Even if scattered waves are visualized, inspectors still need to carefully inspect the images. To automate this, this paper proposes a deep neural network for automatic defect detection and localization in LUVT images. To explore the structure of a neural network suitable to this task, we compared the LUVT image analysis problem with the generic object detection problem. Numerical experiments using real-world data from a SUS304 flat plate showed that the proposed method is more effective than the general object detection model in terms of prediction performance. We also show that the computational time required for prediction is faster than that of the general object detection model.
Nondestructive testing has become increasingly important in recent years. Ultrasonic nondestructive testing is the most widely used nondestructive testing method in engineering applications. Ultrasonic nondestructive testing uses an ultrasonic probe to inject ultrasonic waves into an inspection target and detect internal defects from the scattered waves obtained. In general, it is difficult to determine details such as the location of defects only by viewing the measured scattered waveforms. Therefore, the location of defects is identified by inverse analysis methods such as aperture synthesis method 1), inverse scattering analysis 2), and time reversal method 3). Meanwhile, the laser ultrasonic method, which enables non-contact inspection of structures and materials, has recently been developed and applied in the field of nondestructive testing. In particular, the laser ultrasonic visualization test (LUVT) 4),5), which is an attempt to visualize ultrasonic propagation on the surface of a specimen, has the advantage that defects inside structures and materials can be detected relatively easily without the need for a skilled inspector. So far, applications to metallic materials such as steel and aluminum, materials that exhibit acoustic anisotropy properties such as CFRP 6), and the nuclear and aerospace industries 7),8),9) have been reported. Despite the impressive developments in ultrasound technology, it is difficult to ignore the shortage of engineers. Even with the evolution of structural and material inspection equipment and ultrasonic visualization technology, human inspectors must still make the judgment on the presence or absence of defects for a huge amount of specimens, and the shortage of engineers, which has already crept up on us in the near future, must be urgently addressed.
On the other hand, there are growing attempts to automate ultrasonic nondestructive testing through machine learning. Prior to the advent of deep neural nets, the predictive models used to automate ultrasonic nondestructive testing were low-capacity models involving shallow neural nets 10),11) or SVMs 12),13). In the 2010s, models with huge capacity, such as deep neural networks, have successively revolutionized a variety of fields, and deep convolutional neural networks have risen to the position of de facto standard in the field of image recognition 14),15),16). In response to this fact, a number of attempts have been made to use deep models for ultrasonic non-destructive testing 17),18),19),20),8),21). Unlike other machine learning applications, ultrasonic non-destructive testing using huge-capacity machine learning models has a serious problem, which is that the positive examples for training predictive models tend to be sparse, and the training datasets cannot be made large. To achieve better prediction performance for unseen data, it is known that careful selection of a model with a capacity commensurate with the size of the available data set is important22),23),24). The task of defect detection from LUVT images addressed in this study is highly related to the task of object detection in generic image recognition. However, for the reasons mentioned above, there is concern that the direct use of models that have been successfully used for generic object detection in the field of image recognition may be at risk of overfitting.
In this study, we re-designed the deep model structure to fit the task of defect detection from LUVT images. First of all, we examined the differences between the LUVT image analysis task and the generic object detection task. We then identified which modules of a typical state-of-the-art object detection model are necessary for the LUVT image analysis task and which modules are unnecessary. We finally obtained a slimmer framework by eliminating unnecessary modules for LUVT image analysis. This paper reports the experimental results of applying the resultant slimmer framework to real LUVT image data and comparing it with the state-of-the-art object detection model.
In this section, a brief summary of LUVT is provided for the sake of explanation in the subsequent sections. In non-destructive inspection using LUVT, a laser is irradiated onto the target structure or material as shown in Figure 1. At this time, ultrasonic waves are generated at the laser irradiation point. This ultrasonic wave is then sent to a predetermined receiving point, where the ultrasonic waves are received at a predetermined receiving point which is an ultrasonic probe. The ultrasonic probe receives the ultrasonic waves by interchanging the transmitter and receiver points using the conflict theorem. The receiver point is then the ultrasonic wave source. Interchanging the receiving point and the transmitting point produces the pseudo ultrasonic waveform transmitted from the ultrasonic probe at the receiving point. Therein, the transmitting point is the laser irradiation point. In other words, when this operation is performed using the conflict theorem, the apparent sending point is the ultrasonic probe and the receiving point is the laser point, which is interchanged with the original setup. Therefore, if the ultrasonic probe is fixed and the laser is densely scanned over a suitable region of the specimen, and then the conflict theorem is applied to all combinations of the ultrasonic probe and all laser irradiation points, a pseudo-time history image of ultrasonic propagation in the laser scan region emitted from the ultrasonic probe can be obtained. Figure 2 shows two examples of LUVT image obtained for a SUS304 flat plate, where Figure 2(a) is an LUVT image without defects, and Figure 2(b) is an LUVT image with a defect. As shown in Figure 2(a), when there is no defect, the incident ultrasonic waves from the ultrasonic probe located above propagate directly without being scattered. When there is a defect, scattered waves from the defect are generated as shown in Figure 2(b). An advantage of LUVT over other ultrasonic tests is that even a non-expert engineer can easily see the scattered waves from defects and judge the presence or absence of defects in the specimen. We developed a deep neural network model that estimate the presence and location of defects based on the generation of such scattered waves.


It has been more than a decade since deep learning was introduced to the field of computer science as a technique for detecting objects in images, and its performance has evolved at a dizzying pace 25),14),26).
Rather than blindly trying the generic object detection techniques, a prior thorough understanding of the differences between generic object detection and the task in this study leads to development of a better machine learning algorithm. This section provides a discussion on the comparison of LUVT image analysis and generic object detection.
Dataset: When developing machine learning models, it is important to consider not only the power of the input to predict the output, but also the size of the data set available to train the model. If the task is simpler and the dataset is larger, it is possible to train larger and more complex models and expect higher generalization performance. Meanwhile, if the dataset is small, large models will cause overfitting and reduce the generalization ability. In many generic object detection methods, data for training can be collected simply by taking pictures with a camera, making it relatively easy to acquire so-called big data as long as the annotation cost is paid. In contrast to the generic object detection, collecting LUVT data requires a time-consuming and costly procedure, such as manufacturing an inspection target with internal defects, irradiating it with laser ultrasonic using a laser ultrasonic visualization system (Figure 1), and performing the imaging process. The size of the dataset used for training, therefore, cannot be scaled easily due to the high cost paid for this data collection process.
Number of detections: When benchmarking generic object detection techniques, all objects in each input image are required to be enumerated. For example, in the self-driving task, it is necessary to detect every pedestrian because missing a pedestrian can lead directly to a serious accident. In contrast, the LUVT task does not claim enumeration of all defects in a specimen, because the importance for the nondestructive testing lies in whether there are no defects or whether there are one or more defects.
Number of categories: Deep learning evolved with multi-category classification as its most fundamental task. Most generic object detection models incorporate this trend and are equipped with a module for multi-category classification. In fact, many generic object detection tasks require categorization of detected objects. For example, in the case of object detection from an in-vehicle camera, the object is classified into many categories including a pedestrian, a bicycle, a traffuc sign, and a bus. In the LUVT task, it is possible to classify a defect into one of types such as a crack and a cavity, although it is not necessary. Therefore, in this study, we decided not to add a module for categorization to the model framework.
Background: The backgrounds for generic object detection are often diverse. For example, in the case of vehicle-mounted camera image analysis, the background may be an urban area, a rural landscape, daytime, or dusk. The background of LUVT images is approximately uniform, although there is a lot of signal noise that is too strong to distinguish from the true scattered waves.
Bounding box: Generic object detection is required to accurately estimate the rectangle surrounding the object, referred to as the bounding box. For example, after detecting the bounding box of a pedestrian, pose estimation can be performed subsequently to predict the next action of the pedestrian. For LUVT, once the approximate center of the defect is known, it is not necessary to go as far as the bounding box, since the location can be examined in some other way if necessary.
Preferable prediction performance: Depending on the application, generic object detection is usually required to reduce both false positives and false negatives at the same time. For example, in the case of automatic driving, if a driver inadvertently thinks there is a pedestrian crossing the road and brakes suddenly, even though there is no pedestrian crossing, the car may be rear-ended. Meanwhile, overlooking a pedestrian crossing a road can cause a catastrophe. For nondestructive testing of structures and materials, the reduction of false negatives is particularly important. Even if a defect in a structure or material is detected incorrectly by LUVT, it can be corrected by the inspector’s visual confirmation, so it does not cause serious losses. In contrast, missing defects can cause significant losses in the future.
What structure is appropriate for a nerual network to detect defects in LUVT images is discussed in this section.
Transformers: Recently, the Vision Transformer (ViT) 25) has attracted attention as an alternative image classifier to convolution-based networks, and the Swin Transformer 27),28) and DETR 29),30) have emerged as ViT-based object detectors. However, Transformer-based models require huge datasets for training and are considered unsuitable for the task of LUVT image analysis, where available datasets are limited 31).
Convolutional networks: Many convolutional network-based object detection algorithms have been developed before the emergence of ViT. There are two types of object detectors: two-stage and one-stage approaches. The two-stage approach first estimates candidate regions and then removes non-target objects from the candidate regions 32),33). Two-stage approaches are highly accurate but computationally expensive. In contrast, the one-stage approach solves the issue of heavy computational complexity by combining the pipeline into a single network 34),35),36). Furthermore, the accuracy of the one-stage approach has already been comparable to the two-stage approach.
A majority of the high-performance one-step approaches employ the model structure shown in Figure 3(a) 14). The model structure consists of three modules: the backbone, the neck, and the head. Given an image input, the classification head outputs predictions of object categories, and the regression head outputs predictions of object locations. The backbone extracts a feature map from the image. The neck contains a multi-scaling module to avoid missing small objects.
Proposed models: Based on the characteristics of LUVT images described so far, and after the considerations described below, we re-considered simple models having emerged at the dawn of deep learning to develop an automatic defect detection method for LUVT 37). The requirements for LUVT are simpler than those for general object detection tasks, although the set of available data for training is small. To gain the usefulness of the prediction model, the model structures and the tasks are in need to be selected carefully. Consequently, the goal of this study was designed not to categorize, but only to predict the presence or absence of defects. The proposed model detects only one defect and predicts only the center location of that defect, not the bounding box. In such a task setting, the state-of-the-art object detection model has many unnecessary modules. Therefore, in this study, we decided to eliminate the unnecessary modules, leaving the backbone that performs feature extraction. Usually, non-destructive ultrasonic testing can detect defects of the same size as the wavelength of the ultrasonic waves being handled. Although LUVT has a wide range of applications, this study is a basic investigation for detecting defects in metallic materials such as steel and aluminum, and aims to detect defects with a length of several millimeters. Defects that are sufficiently small compared to the wavelength of ultrasonic waves are ignored because ultrasonic waves penetrate the defect and no scattered waves are generated. Therefore, we did not aim to detect such small scratches in this study, which allows us to eliminate the neck module. The above considerations led to a simple model structure as shown in Figure 3(b).

This section describes how a dataset for machine learning and performance assessment was prepared for the experiments reported in Section 7.. The specimens were SUS304 flat plates with a thickness of 2 mm as shown in Figure 4. The laser scan area was a square area of 8cm × 8cm, as shown by the red dotted line in Figure 4. The laser irradiation distance was approximately 59 cm. The laser irradiation pitch and the number of irradiated points in the area, respectively, were 0.501mm × 0.501mm and 159×161=25,599 points for the horizontal and vertical directions. The ultrasonic probe was placed above the dotted line. The LUVT principle follows the conflict theorem, which leads to the fact that the incident ultrasound in the LUVT image is incident from above the image. The laser scan is performed over the entire area of this dotted box, allowing the defect detection within this area. The defects on the SUS304 flat plate were artificial defects, the shape of which is a slit of 2 cm in length, 1 mm in opening width, and 1 mm in depth. Hereinafter, the slits are simply referred to as defects. Nine positions were selected and a split was made in each position as shown in Table 1. A surface wave probe with a center frequency of 2 MHz was used as the ultrasonic probe. Under these LUVT conditions, several SUS304 plates without and with defects were irradiated with lasers, and the ultrasonic propagation at each time was imaged by LUVT.
When LUVT is applied to a sample with defects, the incident ultrasonic waves are scattered by the defects at a certain time, generating scattered waves, as shown in Figure 4. These images consist of approximately 520 steps, which are the time history of the ultrasonic waves generated in the SUS304 plate from the ultrasonic probe at the top of the image until the incident wave finally reaches the bottom of the laser scan area.
In order to learn prediction models and to evaluate the prediction performance, it is necessary to assign the true binary class label to each image used for training and performance evaluation. In this study, we labeled the images between the time the ultrasonic wave hits the defect and the time when the wavefront of the scattered wave generated by the defect is no longer visible as the annotated label for the presence of a defect. The other images were assigned the annotated label of no defect. In addition to the presence/absence of defects, the location information of each defect was also annotated. Thus, an LUVT image dataset was organized as shown in Table 1. To evaluate the performance of the prediction model, the dataset was divided into three exclusive subsets called training dataset, validation dataset, and test dataset. The training data were used for training the prediction model. The weight parameters were recorded for each epoch, and the weights with the best prediction performance on the validation dataset were selected as the learning results. The selected weights were used to evaluate the performance on the testing data. The dataset consists of ten image series. Six of the ten series were used for training, two were for validation, and the remaining two were for testing.


In this study, a predictor was developed to predict the presence or absence of a defect in a SUS304 flat plate, as well as to estimate the location of the defect. The network shown in Figure 3(b) contains weight parameters to be adjusted for both the backbone and the head. The values of the weight parameters are automatically determined from the training dataset. The training of a neural network is usually done by minimizing an error function also known as an empirical risk function. The error function employed in this study consists of the sum of the error term for the classification task and the error term for the regression task. The error term for the classification task was the average of the cross-entropy loss for each training data. The cross-entropy loss is large when the output of the discriminative head is inconsistent with the actual presence or absence of defects. The error term for the regression task was the mean squared error of the predicted positions for the LUVT images containing a defect.
Learning is done by standard mini-batch gradient descent. Denote by Ψ(I; θ) := [Ψ1(I; θ), . . . , Ψd−1(I; θ), 1]⊤ ∈
d the feature map extracted by the backbone module from an input image I, where θ is the weight parameter included in the backbone. The classification head and the regression head use W c := [w1,c, w2,c] and W r := [w ,x,r, w y,r], respectively, to output two-dimensional signals (W c)⊤Ψ(I ; θ) and (W r)⊤Ψ(I ; θ).
Let N be the number of training images. Let I (1), . . . , I (N) be the training images. Each image I (n) is given a class label c (n) ∈ {1, 2}. Let N+ be the number of training images with defects and N− := N − N+ be the number of training images without defects. The training image order is arranged as c(1) = · · · = c (N+) = 1, c (N++1) = · · · = c (N) = 2. The weight parameters of the developed deep model are (θ, W c, W r), and these values are adjusted to minimize the error function. The error function R(θ, W c, W r) is a linear combination of the error term for classification and the error term for regression:
However, ℓ(n) := x (n), y (n) ⊤ is (x, y) coordinates of the defect location given by the annotation. CE(·; c) :
2 →
denotes the cross-entropy loss function:
By minimizing the error function R, the parameters of the classification and regression heads can be optimized, and at the same time the backbone for feature extraction can be tuned to be compatible with defect detection and localization.
The size of the LUVT input image was set to 224 × 224. The data were augmented by Resize, Crop, Shift, Scale, Rotate, and transforming Brightness, Contrast, and Hue-Saturation-Value. Normal noise was added for a countermeasure against noise. A saturation transformation was also added to draw attention to the shape of the waveform.
Minimizing the error function R is a convex problem if the backbone weights θ are fixed to pre-trained values such as ImageNet and only W c and W r are adjusted, so optimization is easy 38),39). Instead, if the backbone is to be optimized simultaneously, the optimization scheme must be carefully chosen. Preliminary experiments showed that learning algorithm for the proposed model runs stably in the following settings. Adam is used as the optimization method. The batch size is set to 256. The initial learning rate is set to 10−3 and the model is trained for 30 epochs. In the first ten epochs, only the classification and regression heads are trained. From the 11th epoch to the 20th epoch, the learning rate is changed down to 10−4 and the tail blocks of the backbone are also updated. In the last ten epochs, the entire model is trained.
To investigate how well the automatic inspection algorithm developed in this study can predict the presence or absence of defects and how accurately locate them, we conducted an evaluation experiment using the dataset described in Section 5..
Backbone: The type of the backbone in the proposed framework can be selected. We examined four types of backbones: EfficientNet 40), ResNet18 41), ResNet34, and ResNet50. The reason for choosing EfficientNet is to compare it with EfficientDet. In Section 4. we mentioned that it is preferable to simplify the model structure for the LUVT image analysis. EfficientDet is EfficientNet as its backbone and many additional modules for generic object detection. EfficientNet has several options for the model size. In this experiment, B0 was chosen. The corresponding option for EfficientDet is D0. Efficient-Net and EfficientDet shall be abbreviated to EffNet and EffDet, respectively, at the lack of space. Combining the proposed framework with EfficientNet and comparing it with EfficientDet allows us to verify our idea discussed in Section 4.. In addition, we added typical convolutional networks, ResNet18, ResNet34, and ResNet50, to test the performance when the backbone is changed to other convolutional networks.
Computational resources: Two types of computers were used. One was equipped with a GPU, and the other was not, as summarized in Table 2 and Table 3. Both yield the same prediction results, but with different computation times. The GPU machine makes faster prediction, but consumes more power. Laptops with GPUs exist, but they are too heavy currently, increasing the burden on the human inspector. The other computer does not have a GPU, and thus provides a more realistic environment for implementing the proposed method.
Learning curves: The learning curves of the deep learning models, plotting the relationship between loss and epochs, are shown in Figure 5. The green curve represents the training curve, while the red curve represents the validation curve. Figure 5(a) and (b) show the learning curves of EfficientDet and the proposed model with ResNet18 as the backbone, respectively. As the training progresses, there is a decreasing trend in both the training loss and validation loss. After the 83 epoch, the loss in EfficientDet remained consistently below 0.6, and similarly, the proposed method kept the loss value below 0.3 starting from the 26 epoch. This indicates that the learning process is successful for each model.
Performance assessment: A positive example in the testing dataset was treated as a true positive if the classification head predicted that the defect was present and the distance between the predicted position and the correct defect position in the regression head was within r pixels, where r is a pre-specified error margin. A positive example was a false negative if no defect was predicted or the distance between predicted position and the true position was over r pixels.
Comparison of different backbones: Figure 6 plots the precision and the recall for different error margins r. For a threshold value of r ≥ 80, both the differences in the two performance measures of the four proposed methods were small (At r = 80, the difference in the precision was 0.005 and the difference in the recall was 0.023 in case of comparing EfficientNet with ResNet18).
When the threshold r was too small, neither method predicted correctly. As the threshold was gradually increased, the precision and the recall of the proposed method with ResNet18 went up earlier than those with EfficientNet. For example, at r = 40, both Precision and Recall for ResNet18 were 0.92, while they were 0.66 and 0.67 for EfficientNet. Around r = 30, the precision and the recall were lower than those of ResNet18 when the larger modules ResNet34 was used as the backbone. This is due to overfitting caused by the excessive capacity of the backbone. ResNet34 is a module with a larger capacity than ResNet18. ResNet34 is 1.90 times larger than ResNet18. Curiously, however, ResNet50, which has a larger capacity, achieved higher precision and recall than ResNet34. Balestriero et al. reported that large capacities can lead to overfitting, although further increasing the capacity can once again lower the generalization error 42), which might explain this phenomenon.
Comparison with EfficientDet: We compared the performance of the proposed model and Efficient-Det where EfficientDet was a conventional state-of-the-art model. The precision and the recall of EfficientDet are plotted against the error margin r in Figure 6. The detection performance of the proposed method exceeded that of EfficientDet when r ≥ 40, which means that the proposed model reduced false detections and missed defects of the conventional method. There are some qualitative differences between the defect detection results of EfficientDet and the proposed method. Many defectfree images in the datast were contaminated with considerable measurement noise and other factors that caused waves to be disturbed regardless of defects. The prediction results for such defect-free images are shown in Figure 7(a),(b). EfficientDet often overdetected such wave disturbances, bringing many false positives. Another typical case was that the scattered waves emitted from the defect were momentarily missing after the waves reached the defect, and the defect was not clearly visible. The prediction results for images with such defects are shown in Figure 7(c),(d). In many cases, the proposed method correctly detected the defects, whereas EfficientDet often missed such defects.
It can be seen from Figure 6 that the curves of the precision and the recall of EfficientDet rise faster than those of the proposed model. This suggests that EfficientDet has a smaller error in estimating the location of detected defects. The prediction results for a certain defect image are shown in Figure 7(e),(f). It is observed that EfficientDet often estimated the location of defects more accurately when they were detected correctly. However, the inspection is not entirely left to the machine in automated ultrasonic nondestructive testing using LUVT. Defects detected by the machine may be reconfirmed by the human inspector. Hence, a certain amount of error in the prediction of defect positions is tolerated. The curves of r ≥ 80 in Figure 6 suggests that, without demanding a too small localization error, the proposed methods can achieve much higher precision and recall than EfficientDet.
Comparison of computation time: We investigated whether the increased number of parameters compared to EfficientDet, a conventional object detection model, would result in a practical problem in terms of computation time required for prediction. The results are shown in Table 4. EfficientDet took 3.89 msec per image using a GPU and 33.52 msec using a CPU. The proposed method using ResNet18 required 1.90 msec of GPU time and 7.70 msec of CPU time. The runtime was reduced despite the increase in capacity by a factor of more than three. This could be attributed to the elimination of unnecessary modules from the conventional object detection model.






In this paper, we proposed a deep framework for automatic defect detection and location estimation in LUVT. In order to explore the structure of a neural network suitable for this task, the differences between the LUVT analysis problem and the generic object detection problem were elucidated. Then, the network was reconfigured by examining the necessity of each module in a state-of-the-art model for generic object detection. Computer experiments using actual measurement data of SUS304 flat plates showed that the prediction performance of the proposed methods were superior to that of the conventional object detection method. The experimental results also showed that the computational time required for prediction is faster than that of the state-of-the-art model. In the future, we plan to investigate the effectiveness of the proposed method for other types of materials such as CFRP generating anisotropic scatter waves.
This work was supported by SECOM Science and Technology Foundation and JSPS KAKENHI (C)(21K0423100). This work used computational resources provided by Kyoto University and Hokkaido University through Joint Usage/Research Center for Interdisciplinary Largescale Information Infrastructures and High Performance Computing Infrastructure in Japan (Project ID: jh220033).