Monitoring Human Activities in Riverine Space using 4K Camera Images with YOLOv8 and LLaVA: A Case Study from Ichinoarate in the Asahi River

Shijun PAN; Keisuke YOSHIDA; Yuki YAMADA; Takashi KOJIMA

doi:10.11532/jsceiiai.5.1_89

Abstract

A uniform nationwide survey on riverine space utilization has been conducted approximately every five years as part of the "Census of Rivers and Waterfront Areas" in Japan, for properly promoting river projects and river management. Considering significant effort required for human tasks, the survey is commonly carried out for seven days per year. Then, the present river situation is estimated roughly through the year, based on the limited survey results. Therefore, it is challenging to grasp the actual conditions on weekdays, holidays, and at different times of the day. Accordingly, it is difficult to examine the effect of individual river maintenance work quantitatively over years. For this study, the authors in this research tried to recognize human activities on the river bank automatically from 4K camera images taken near the Asahi River diversion weir in Okayama Prefecture, using the object detection model YOLO (i.e., You Look Only Once) with the large-scale multimodal model LLaVA (i.e., Large Language-and-Vision Assistant). Results showed that the combination of these models has the potential to collect information on not only the number and location of people but also various human activities, such as walking, running, and skateboarding.

1. INTRODUCTION

As part of the Census of Rivers and Waterfront Areas, two essential surveys, the Survey on the Number of River Space Users and the River Report Card, are conducted to comprehensively understand river space utilization^1), ²⁾. The user count survey, guided by the Census Manual for Rivers and Waterfront Areas, occurs seven times (i.e., crowded days selected by river management administrators for monitoring the usage of the river) per year to capture the distinct characteristics of river use. For understanding the long-term water-friendly environment-based situation in the riparian area, despite providing a holistic view of general river usage, the current methodology faces challenges in capturing weekday patterns due to the balance required between survey frequency, time, and cost.

The existing broad categorization of use patterns necessitates a more detailed breakdown to precisely understand river space utilization for the future riparian-environmental improvement, maintenance and citizen-convenience and protection (e.g. dog- walking-, kids-safety-based monitoring,), posing a challenge to surveyors from previous human-eye- based monitoring to automatic computeral analysis.

In recent years, many studies have been conducted to detect people using AI technology. As an example, a study³⁾ using the object detection model YOLO⁴⁾ to understand the status of walking and waiting space usage near bus stops showed the possibility of quantitatively understanding items such as user attributes (gender and age), the number of bus riders, and the number of people using waiting benches. On the other hand, human activities such as "walking" and "sports (including cycling and running)," which are items in the survey on river space use, were not detected. In addition, no research has been conducted to obtain information on people by combining object detection models and large-scale multimodal systems.

In response, this study proposes a detection method leveraging YOLOv8⁴⁾ and LLaVA⁵⁾ to address the limitations. Focused on riparian area, the innovative approach aims to unravel citizen behaviors using rivers. By merging the strengths of YOLOv8 (i.e., person-based number counting) and LLaVA (i.e., person-based activities analysis), the study seeks to provide an efficient detection approach, transcending current limitations that only one model cannot overcome the tasks like, the person-based number counting, activity analysis area selection and offering an insight into river space usage behaviors at the same time. Considering the current practical issues that need the personnel-based monitoring for a long term, the combination of the mentioned models-based non-personnel-monitoring automatic analysis results is an available method.

2. STUDY SITE

Fig.1(a) highlights the position of the study site in the Okayama Prefecture. In Fig.1(b), the location and the camera-direction of the device have been depicted. Fig.1(c) showed the 4K camera from the drone view. Fig.1(b) performed that Ichinoarate within the bifurcation area of the Asahi River. In Fig.1(b), the role of Ichinoarate becomes apparent as a crucial overflow weir responsible for channeling flood flow from the Asahi River into the distributary Hyakken River. This pivotal function underscores the importance of Ichinoarate in managing water flow in the lower reach of the Asahi River. To facilitate a comprehensive understanding, Fig.1(b) also provides insight into the perspective of the 4K camera, as depicted in Fig.1(c). This camera’s positioning becomes instrumental in capturing the river behavior, especially during flood events. A sample image of 4K camera during normal water was shown in Fig.2.

3. METHODS

(1) Specifications of video-taken device

In this research, data collection was derived from an 4K camera (Hikvision, DS-7600 Series) as performed in Fig.1 with the resolution of 3840 × 2160 px and 20 fps (i.e., frames per second), these river monitoring cameras were previously installed to provide real-time information on river conditions during floods from 3^rd, Aug. to 13^th, Dec. 2021. During the non-flood seasons, the 4K cameras also continue working for the data collection.

(2) Outline of artificial intelligence models

In this research, there are two models that have been applied for the person-related number and activities analysis from the computer-vision- and the LMM- based perspectives, individually. The computer vision model, YOLOv8 (i.e., You Only Look Once version 8) is a state-of-the-art deep learning framework for object detection, segmentation, pose estimation and classification tasks, the YOLOv8 cannot yet identify the specific activities without additional training. The multi- modal model, LLaVA (i.e., Large Language and Vision Assistant) is a multi-modal model that combines vision and language for general-purpose visual and language understanding. The parameter setting is written in Table 1.

(3) Work Flow of investigating human activities

In Fig.3 (a), the data collection phase begins by deploying a 4K camera, strategically positioned to meticulously capture the targeted area of interest. Moving to the analysis stage, as depicted in Fig.3 (b) and (c), the YOLOv8 framework was for the object detection, effectively locating various persons within the recorded 4K footage. The integration of LLaVA further enhances the analytical depth, facilitating analysis that unveils an understanding of the detected activities. By synergizing the capabilities of YOLOv8 and LLaVA, capturing intricate details of human interactions with heightened accuracy. As the final step in Fig.3 (d), the results are meticulously compiled and processed, giving rise to comprehensive reports and visual representations that highlight identified human activities.

4. RESULTS AND DISCUSSION

To optimize computational costs and refine the analysis area, the authors strategically structured two distinct groups for testing purposes. Group-A specifically aimed to assess the detection capabilities in full-screen and individual cropped images, focusing on the varying locations concerning the 4K camera, namely closer-distance and longer-distance locations. This evaluation aimed to discern the effectiveness of detection in different spatial contexts, providing insights for the accurate person- based numbers.

Alternatively, Group-B was dedicated to investigating the impact of the analysis area’s size on confirming person-interacted activities. This group delved into understanding how adjusting the size of the analysis area influenced the outcomes, particularly in scenarios involving interactions among individuals. By categorizing the tests into these two groups, the authors addressed location- related considerations in Group-A and size-related considerations in Group-B, collectively contributing to a more efficient and cost-effective analysis process.

Images were selected by classifying cases in terms of weekday/holiday, morning/afternoon/night, and water-effluence/normal-water, covering most river conditions during the shooting period. As shown in Table 2, 36 cases were selected as Group-C. A single targeted image was chosen for each case, confirming the presence of people on the river bank. Only in August, there were water effluence situations.

(1) Group-A: Analysis on person in full-screen images.

In Fig.4, a comparative analysis of results obtained from LLaVA and YOLOv8 reveals that both methods area. However, LLaVA faces challenges in extracting any information from the full-screen image of the longer-distance area. In contrast, YOLOv8 exhibits the capability to detect three persons in this more distant region. As illustrated in Fig.5, a straightforward prompt was employed to verify LLaVA’s ability to count person-based numbers on individual cropped images (i.e., upper-half and downside-half parts of the original image). Evidently, the longer-distance area also encounters limitations in extracting this information using LLaVA (i.e., misclassified 3 persons to 2 persons).

The findings from Fig.4 were subsequently revealed that the closer-distance output effectively extracted person-related data using both YOLOv8 and LLaVA. Conversely, the longer-distance output exhibited a divergence in performance, with YOLOv8 yielding partially correct results and LLaVA misclassifying person-related numbers entirely.

(2) Group-B: Analysis on person-interacted activities.

Fig.6 upper performed a woman that is running, Fig.6 middle presents a depiction of two individuals, a woman and a child, engaged in walking or running along the river bank in the upper frame. Notably, when the cropping area is expanded, as demonstrated in the Fig.6 downside, an additional element is introduced: a woman with a leashed dog. This augmentation in the cropping area induces a discernible shift in the outcomes generated by LLaVA.

Drawing insights from the LLaVA-derived results, it becomes apparent that when analyzing a group of persons, particularly when interactions among these individuals come into work, careful consideration need to be given to the selection of the cropped area. The behaviors introduced by the inclusion of a leashed dog underscores the importance of judiciously defining the spatial boundaries of the analysis. As such, the implications of alterations in the cropping area on the interpretability of the results must be taken into account, ensuring a nuanced and comprehensive understanding of person-related activities within a given scene.

(3) Group-C: Application on the practical cases.

To avert misclassification of activities resulting from incomplete person coverage, a deliberate choice was made to employ a cropping process depicted in Fig.7. In consideration of the diverse periods encompassing study site, the authors judiciously selected 36 cases, meticulously detailed in Table 2, derived from various situations. In consonance with the findings from Group-A, the focus was placed on closer-distance areas.

This methodology ensures that the images encapsulate the entirety of interacting individuals, thereby enhancing the precision of LLaVA’s analysis. Subsequently, the requisite analysis from the images was methodically extracted, as illustrated in Fig.8, denoted by the delineated yellow areas (the Bounding-Box- and Mask-based results on the individual objects shown in Fig.9). In Fig.10, the authors applied the prompt in detecting the person- based number and activities. The outcomes of person-related activities and numbers across different cases, as discerned through LLaVA, were systematically compiled in Table 3. Half of the person-related activities and numbers were accurately detected across all 36 cases comparing with GT, that were counted by the author without twice counting. From the output derived from the Table 3, the prompt is not totally accurate for the person-based activities and numbers. In the future, the other prompts are in the need for the accurate detection.

Considering that CCTV cameras are more widely used in river monitoring in Japan than 4K cameras, the authors performed similar validations by resizing the video captured with a 4K camera from its original resolution (3840 × 2160 px) to a standard CCTV camera resolution (1920 × 1080 px) in Table 4. As a result, no significant differences were found between original and resizing datasets in the YOLO and LLaVA analyses. This implies that the earlier results of this study may be applicable in cases using CCTV cameras installed in most Japanese rivers.

5. CONCLUSION

The current study introduces an innovative methodology that integrates YOLOv8 and LLaVA for the daily observation of person-based activity in the riverine space. Leveraging the implementation of prompts, this approach successfully detected person- related activities and numbers in 18 out of the 36 cases examined. This collaborative fusion of YOLOv8 and LLaVA, augmented by the guiding influence of prompts, marks a significant stride towards enhancing the efficacy of understanding the long-term water-friendly environment-based situation in the riparian area.

ACKNOWLEDGMENT

The authors appreciate the Chugoku Regional Development Bureau, Ministry of Land, Infrastructure, Transport and Tourism for offering materials related to river management tasks and 4K video images along the Asahi River and Hyakken River.

References

Corresponding author

Register with J-STAGE for free!