2023 Volume 4 Issue 2 Pages 66-74
Automatic structural evaluation is crucial, particularly in post-disaster scenarios. While significant progress has been made in Image Captioning, its potential as a tool for structural damage assessment has not been thoroughly explored. Image captioning offers the ability to generate descriptive captions that aid in further analysis, and decision-making processes. This study focuses on developing an image captioning model designed for structural damage images and compares four popular convolutional neural networks (CNNs), namely VGG16, ResNet50, InceptionV3, and EfficientNet. Interestingly, all evaluated models performed very well in generating captions for structural damage images. However, InceptionV3 showcased a slight edge over the other models. This highlights its excellent caption generation ability for structural damage evaluation. Furthermore, while variations in training times were observed among the CNN models, it is noteworthy that during practical applications, the differences in processing times for caption generation were found to be negligible. The findings of this study underscore the effectiveness of different CNN models for image captioning in the context of structural damage evaluation. Moreover, it emphasizes the potential of image captioning as a valuable tool in automated structural evaluation. The study also calls for further research to enhance the accuracy, efficiency, and interpretability of automatic structural evaluation using image captioning approaches.
The devastating 7.8 magnitude earthquake that struck Turkey and Syria on February 6, 2023, stands as the fifth-deadliest earthquake of the 21st century, resulting in an official death toll exceeding 52,000 across the two countries 1).
The impact of this disaster becomes even more apparent when considering the statistics: estimates indicate that more than 150,000 buildings have collapsed or become uninhabitable in Turkey alone, leaving over a million people homeless 2). This highlights a pressing concern for developing nations which is the need to effectively manage natural disasters during both the rescue and evaluation phases.
In the aftermath of a powerful earthquake, the scarcity of experts available to conduct structural evaluations for buildings poses a critical challenge. Simultaneously, a significant number of individuals may either seek refuge within potentially hazardous structures or vacate safe buildings due to safety concerns. Traditional methods of expert visual inspection, while essential, are marred by their time-intensive nature, proneness to human error, and unavailability during immediate safety evaluations necessitated by natural calamities. However, the convergence of computer vision (CV) and image processing techniques has yielded transformative outcomes in various fields, including structural health monitoring 3) and medical diagnostics 4).
A pivotal area of exploration within this realm is damage detection and identification, crucial for assessing the well-being of structural elements, infrastructure, and systems. Traditionally reliant on human inspection, this process has gradually evolved with the incorporation of CV and deep learning, giving rise to automated solutions. Although widely employed for classifying structural damage images into distinct categories 5,6). These classification techniques frequently come up short in offering intricate contextual understandings, especially in scenarios where images display multiple forms of damage or include a diverse array of structural components.
The emergence of Image Captioning (IC), an integration of CV and Natural Language Processing (NLP), has paved the way for generating comprehensive textual descriptions that encapsulate vital information from images. These captions can encapsulate a broad array of details, ranging from the image’s scene and the extent of damage to specific damage types, locations, and quantities, as shown in Table 1.
Unlike mere damage detection, IC models excel in recognizing damage severity, types, and spatial distributions, yielding outputs with multi-dimensional information.
While IC has found utility in diverse domains and has drawn substantial academic interest 7). In the field of civil engineering, certain researchers have used IC to aid construction site activity monitoring 8,9), while others produce explanatory texts for diverse forms of bridge damage 10). But to our knowledge, the use of image captioning as a tool for evaluating structural health post-disaster remains an area yet to be explored.
In this study, our aim is to address this void by constructing IC models using multiple CNNs. Our goal is to leverage the capabilities of deep learning in generating insightful captions, thereby making a meaningful contribution to the evaluation of structural well-being following catastrophic incidents.
IC has witnessed rapid development in recent years 11,12). We will use a general model shown in Fig. 1. The model consists of two parts working in parallel: The image understanding part and the text generation part.
The image understanding part is responsible for identifying the features in the image and trying to understand what information it has.
The text understanding (NLP) part is responsible for understanding and caption sentences. Which should have correct grammar, syntactic structure, coherent words, and be informative.
These features, along with the caption, are passed to the model, which will understand the connection between the caption and the image features and be able to produce a new caption when similar image features are presented.
(1) Image understanding part:
For this study, we used four of the most famous CNNs developed in recent years: VGG16 , Res-Net50, InceptionV3, and EfficientNetB0 13-16) to compare their performance and try to figure out the best model to achieve Image captioning for damage detection and identification.
Each of these models is suitable for general image classification tasks. However, we will use the features only for IC (not the classification), so we deleted the last classification layer from each model.
a) VGG16
VGG16 is a widely adopted CNN architecture that uses a small convolution filter size deep convolutional network. It is known for its simplicity and effectiveness. It consists of 16 layers with a chain of convolutional and max-pooling layers, followed by fully connected layers.
b) ResNET50:
ResNet50 has a deep residual network architecture, which addresses the issue of vanishing gradients using skip connections. This enables the network to learn deeper representations without significant degradation in performance.
c) InceptionV3:
InceptionV3 is designed to capture rich visual information through its inception modules, which use multi-scale convolutional operations. It achieves a good balance between computational efficiency and performance. InceptionV3’s multiple parallel convolutions help capture both local and global image features, enabling it to extract diverse visual cues that can be beneficial for generating descriptive captions.
d) EfficientNET:
EfficientNET has gained attention for its superior performance and computational efficiency. EfficientNetB0 uses a compound scaling technique to balance the network depth, width, and resolution. It achieves high accuracy with fewer parameters compared to other models.
(2) Caption understanding part:
For NLP, The main part is the Long Short-Term Memory (LSTM) 17), a specific type of recurrent neural network designed to capture long-term dependencies in sequential data. It analyzes the input sequence step by step, incorporating contextual information from previous words to generate an output.
(3) Caption generation part:
The former parts are known as the encoder part of the model, and the part after (ADD - Dense ReLU - Dense Softmax) is the caption generation part (The decoder), which takes the visual features from the image and the processed textual information as inputs to generate the final new caption.
(4) Dropout:
A dropout layer is applied to avoid overfitting for the Image and text parts with a percentage of 40%. Dropout randomly sets a portion of the connection to 0 at each update during training, which prevents the model from overlying on a specific link (feature) in the model and helps it to explore more information from the previous layer.
(5) Full model:
Fig. 2 shows the full model where we input the data into two branches for image and text understanding. After the fusion of features, a dense layer is introduced to serve as a non-linear transformation for the last layers from both past parts. Finally, we have a SoftMax layer that maps the output of the preceding dense layer to a probability distribution over the vocabulary. Each value in the final caption represents the probability of a specific word being the following word in the generated caption.
Four separate models were built for each CNN we used.
(1) Dataset
Using deep learning to produce good-performing models requires a large amount of data that is both diverse and representative of the specific task. Such a data set is not available for structural damage. Thus, we used 2000 unique images from the famous database Image-Hub Structural 3).
The image-Hub structural database is provided for classification tasks, which means captions are not provided, so our database was manually labeled with a limited vocabulary specifically used to produce a ground truth-like caption rather than natural human-like captions, which tend to exhibit more linguistic creativity, including variations in writing style, choice of words, and sentence structures. Ground truth captions, on the other hand, prioritize accuracy, clarity, and representing the facts of the visual scene, which often results in more concise and standardized captions for models.
This is possible because our purpose is to develop an IC model so that its output can be further processed to get as much information from an image (or set of images) as possible.
Table 2 shows the vocabulary used in the model and the occurrence of each word in all the 2000 captions separated for training, validation, and test sets, utilizing a randomized 60-20-20 split.
(2) Training
The data set was divided into 1200 images for training and 400 images for validating. We used adaptive moment estimation (Adam) 18), a gradientbased optimization of stochastic objective functions. Adam is aimed at machine learning problems with large datasets and-or high-dimensional parameter spaces.
The loss function used is categorical cross-entropy, and the number of epochs is 20, after which the loss of training the models is flattened. Fig. 3 shows the loss and accuracy percentage over the training period. A clear overfitting is observed because of the relatively small size of the database for IC.
(3) Training time comparison:
Fig 4, provides insights into the number of features managed by each model across different layers. Moving on to Fig 5 it illustrates the time consumption associated with various stages in the model development process. Specifically, "Preparing features" quantifies the time needed for image-to-feature conversion via the CNN, "Training" denotes the time invested in training the model over 20 epochs, "Test Prediction" indicates the time taken for the model to predict new captions for the test set, in order to evaluate the models, and finally, "Total" signifies the overall time including all previous stages.
Upon analyzing these figures, a clear pattern emerges: VGG16 encompasses the highest feature count across its layers. This characteristic results in relatively slower training and feature preparation times. However, post-training, the time spent by all models is in a similar range. This highlights that a CNN with more features inherently demands more time for model training. Yet, in practical scenarios, this aspect becomes less decisive. Instead, the choice of CNN should be grounded primarily in performance considerations.
The machine used for training the model specification are as follows: Windows 10 OS, Intel(R) Core (TM) i7-10700 CPU @ 2.90 GHz, Installed RAM 16.0 GB, Intel(R) UHD Graphics 630.
To evaluate the model performance, We employed BLEU, ROUGE, CIDEr, METEOR, and SPICE scores 19-23). These metrics provided valuable insights into the quality of the generated captions, which allows for assessing the model’s effectiveness in producing accurate and meaningful captions.
The accuracy of the models is shown in Table 3. All the models performed well, scoring good to high in all the metrics. InceptionV3 performed slightly better than the rest of the models, while VGG16 had the lowest scores despite having the highest number of features.
N-gram Precision scores (BLEU scores) evaluate the precision of n-grams (sequences of n words). BLEU3 and BLEU4 had lower scores because of the long n-gram in our relatively short reference captions. This may happen if the models struggle with capturing longer dependencies or generating more diverse and contextually accurate phrases.
Recall-based Metrics (ROUGE2 and SPICE) are also lower than other metrics as they focus on capturing the recall of n-grams or semantic components between the predicted and reference captions. This means the model misses important n-grams compared with other words.
High scores in BLEU1 and BLEU2 demonstrate that the models are proficient in generating accurate single-word and two-word sequences, respectively. This indicates their ability to produce captions that closely match the individual words and short phrases in the reference captions.
Similarly, the higher ROUGE1 and ROUGEL scores indicate that the models excel in capturing the overlap of unigrams and longer sequences of words between the predicted and reference captions. This highlights their capability to preserve important content and maintain fluency in the generated captions.
The CIDEr and METEOR scores, which evaluate the quality of the generated captions based on reference captions, reflect the effectiveness of the models in capturing semantic similarities and producing captions that align well with the reference captions meaning.
Table 4 shows some example captions from the evaluation data with captions from all four models. When the damage is relatively visible and dominant in the image, all the models produce very similar correct captions. On the other hand, when the damage is slightly hidden, or more than one element is shown in the image, each model will capture some part of the image. This is because the training data has a few numbers of such occurrences. Thus further performance must be evaluated when such models train on big data sets.
In conclusion, this study highlights the significance of image captioning as a promising approach for automated structural evaluation, particularly in the aftermath of disasters. By developing an image captioning model and evaluating it using popular CNN architectures, namely VGG16, ResNet50, InceptionV3, and EfficientNet, we have explored the potential of generating descriptive captions for structural damage images. Among the evaluated models, InceptionV3 demonstrated slightly better performance compared to the other models, as evidenced by its higher scores in most evaluation metrics calculated BLEU1, BLEU2, BLEU3, BLEU4, ROUGE1, ROUGE2, ROUGEL, Cider, METEOR, and SPICE. However, it is essential to note that all the evaluated models performed well, indicating their suitability for image captioning in structural evaluation. The findings of this study signify the potential of leveraging Image Captioning techniques in the field of structural assessment. Generating descriptive captions provides valuable insights and context for analyzing structural damage, aiding decision-making processes after disasters. Future research includes further improving the accuracy, efficiency, and interpretability of automated structural evaluation using image captioning. Exploring advanced deep learning architectures, incorporating more complex image captioning models, and refining the training methodologies are some potential areas of investigation