2025 Volume 6 Issue 3 Pages 966-975
Road attachment facilities, including road signs and lighting, are ubiquitous across vast road networks, making efficient inspection crucial. Previously, AI models were proposed to classify the damaged type of road attachment facilities. However, practical implementation requires an interpretable framework and the ability to estimate damage level. This paper proposes a comprehensive framework based on the damage type classification with Vision Transformer and the damage level estimation with the in-context learning of the large vision-language models (VLMs). The ViT-based damage type classification provides an interpretable framework, while the LVM’s in-context learning enables damage level estimation, a challenging task for ViTs alone. In the last part of this paper, we evaluate our method with real-world images of road attachment facilities.