2024 Volume 5 Issue 1 Pages 89-97
A uniform nationwide survey on riverine space utilization has been conducted approximately every five years as part of the "Census of Rivers and Waterfront Areas" in Japan, for properly promoting river projects and river management. Considering significant effort required for human tasks, the survey is commonly carried out for seven days per year. Then, the present river situation is estimated roughly through the year, based on the limited survey results. Therefore, it is challenging to grasp the actual conditions on weekdays, holidays, and at different times of the day. Accordingly, it is difficult to examine the effect of individual river maintenance work quantitatively over years. For this study, the authors in this research tried to recognize human activities on the river bank automatically from 4K camera images taken near the Asahi River diversion weir in Okayama Prefecture, using the object detection model YOLO (i.e., You Look Only Once) with the large-scale multimodal model LLaVA (i.e., Large Language-and-Vision Assistant). Results showed that the combination of these models has the potential to collect information on not only the number and location of people but also various human activities, such as walking, running, and skateboarding.