2025 Volume 6 Issue 3 Pages 77-90
This study explores the development of a tool that detects various events of interest to river and road administrators by comparing current and past CCTV images. The approach leverages Visual Question Answering (VQA) tasks using Large Vision Language Models (LVLMs). The research began by identifying the tool’s operational requirements and constraints, including commercial usability and exclusion from restricted entity lists. Based on these criteria, three LVLMs were selected for testing: ChatGPT-4o, Gemini 1.5 Flash, and llava-llava-calm2-siglip.To evaluate practical applicability, real-world footage depicting events such as flooding and landslides was paired with prompts from the perspective of infrastructure managers. The outputs from each LVLM were assessed using precision, recall, and F1-score metrics. Among the models tested, ChatGPT-4o demonstrated the highest utility for practical deployment.The study also identified detectable event types in road and river contexts, clarified visual and system input conditions, and examined common causes of false positives and missed detections. Based on these insights, strategies for improving detection accuracy were proposed.