A Road Narrowing Condition Estimation from In-vehicle Camera Videos via Late-Fusion based on Confidence Level Integration of Multiple Classifiers

Hiroki KINOSHITA; Masahiro YAGI; Sho TAKAHASHI; Toru HAGIWARA

doi:10.11532/jsceiiai.5.1_80

Abstract

In snowy and cold regions, piled snow on road shoulders may cause traffic congestion and even traffic accidents. Maintaining the functionality of urban transportation can be a severe problem. In order to maintain the effective width of roads during the winter, piled snow on the road shoulders is removed and cleared. However, getting road information requires a great deal of time and workforce. In this paper, we propose a novel method for classifying the effective road width, narrowed by piled snow on road shoulders, based on videos captured from in-vehicle cameras using features focused on piled snow. Estimating road narrowing conditions from in-vehicle cameras will enable ordinary vehicles to collect road information and create an environment that does not require much time or workforce to gather road information.

1. INTRODUCTION

In snowy and cold regions such as Hokkaido, Japan, snow removal operations are carried out in winter to ensure smooth road traffic. Every time snow is removed from the road surface, piled snow on the road shoulder is formed, which narrows the effective width of roads. When a road is narrowed by piled snow, it obstructs urban functions and road traffic services. This obstacle causes serious traffic accidents and traffic congestion. Currently, visual inspections by road administrators are performed to ensure smooth road traffic. As denser road networks become, many more roads must be inspected, requiring much time and workforce. However, the number of snow removal workers is decreasing year by year. Therefore, snow removal operations in snowy and cold regions must be more efficient.

In recent years, cameras have been mounted in many vehicles with miniaturization and costreduction advances. Previous studies in the field of road management include a road condition monitoring system using low-cost MEMS accelerometers to detect abnormalities in road surfaces¹⁾ and a method using portable sensors to monitor road surface conditions²⁾. In addition, studies^3,⁴⁾ that analyze images obtained from inexpensive vehicle-mounted cameras can be cited as methods for estimating road surface deterioration and winter road surface conditions.

In the previous study5), the authors proposed a method to estimate road narrowing conditions using in-vehicle camera videos into three levels shown in Fig.1. Level 3 is defined based on the criteria for dispatching snow removal operation in Sapporo city. Level 1 is defined since it is necessary to grasp if there is piled snow on the road. In reference⁵⁾, five features: color, power spectrum, traffic lights, surrounding vehicles, and piled snow are utilized to estimate the road narrowing conditions. However, in this study, the features are ideally extracted. When features are not accurately extracted, the estimation accuracy may decrease significantly. For instance, the feature based on traffic lights and surrounding vehicles is obtained based on the object detection model. There may be cases where these features are not appropriately extracted due to omissions in the object detection model. Furthermore, in the previous study, when one feature is excluded when classifying the road surface condition, the accuracy decrease of the estimation is significant.

Some features cannot be extracted in actual situations due to sudden errors, such as snowfall covering the sensors. In some cases, high-cost sensors cannot be mounted. When utilizing a specific method, it is not desirable to have severe performance degradation when a part of the method is not usable. This decrease in accuracy will be a significant issue regarding the practical use of the method. To solve this, a method that does not cause a significant decrease in performance, even if some features are not correctly extracted, is preferable.

Late fusion methods are known to be robust to partial errors. Even if one modality has an error, it does not necessarily propagate the others⁶⁾. Also, when modalities have non-correlated errors, late fusion can be exceptionally robust since the chances of all modalities making the same mistake are low⁷⁾. The late-fusion method allows each feature to be optimized independently. In other words, even if one feature is not extracted correctly, it is unlikely to affect the overall result. Therefore, by extracting the confidence level of feature vectors, the late-fusion method is proposed to classify the road narrowing condition.

This paper is organized as follows. First, in 2, the details of the proposed method are explained. In 3, an experiment is conducted to verify the effectiveness of the proposed method. Finally, conclusions are shown in 4.

2. CLASSIFICATION OF THE ROAD NARROWING CONDITION

This section explains our proposed method for road narrowing condition estimation. The overview of the proposed method is shown in Fig.2. 5 features are extracted from the in-vehicle camera videos in 2.(1). From 2.(2) to 2.(4), the proposed method, utilizing late-fusion-based classification, is explained. Specifically, each feature is classified into three levels, and 15 confidence levels are extracted in 2.(2). In 2.(3) Confidence levels are integrated. Finally, the videos are estimated at three levels in 2.(4).

(1) Calculation of feature vectors

Here, five features that are extracted from the video are explained. First, features based on color and power spectrum are explained in 2.(1)a). Then, features based on traffic light is explained 2.(1)b). Next, features based on surrounding vehicles are explained in 2.(1)c). Finally, features based on piled snow are explained in 2.(1)d).

a) Features based on color and power spectrum When the road is narrowed, the surface is covered with snowfall and piled snow on the shoulders. On the other hand, when the road is not narrowed, the asphalt of the road surface tends to be apparent. Therefore, the color of the road surface changes according to the road narrowing levels. Thus, when the road is narrowed, the luminance gradient tends to decrease with snowfall and piled snow. This means that the spatial frequency changes as the road narrows. Therefore, it is considered that extracting features based on color and power spectrum is effective.

In the proposed method, the W × H pixel video is split into k slices and is extracted as a space focused on the road surface shown in Fig.3. The extracted region is further divided vertically into l slices and horizontally into m slices and obtain l × m patches. Features based on the color and power spectrum are extracted for each patch. Specifically, the acquired patches are first converted to HSV space, and a histogram of hue and saturation is calculated for each patch, and a feature vector based on color calculated. Here, i (i = 1,2, ... I)represents each input video.

Next, to obtain features based on the power spectrum of the images, a Gabor filter and fast Fourier Transform are used for each patch to compute features related to the spatial frequency to obtain a feature vector .

b) Features based on traffic lights

When the road is narrowed by piled snow on the road shoulder, the vehicle’s position is limited. Meanwhile, there is no change in the position of the traffic lights. The road is not narrowed when the vehicle is in the outer lane. Since the position of the traffic lights does not change as the road narrows, it is possible to determine the vehicle’s relative position to the road by extracting the position of the traffic lights in the video taken from the vehicle. Therefore, the vehicle’s position is expected to indicate the road’s narrowing.

Therefore, the proposed method extracts features based on traffic light. The object detection using YOLOv3⁸⁾ calculates the type and location of objects in the video. It enables the detection of traffic lights and the estimation of road constrictions based on the vehicle’s driving position characteristics. The feature vector based on traffic lights, , obtained from each video is extracted.

c) Features based on surrounding vehicles

When a road is narrowed, the effective width of the road is reduced, limiting the running positions of the surrounding vehicles. When the road narrows, the running position of the vehicle in front tends to run near the center of the image. On the other hand, oncoming vehicles are more likely to pass close to their own vehicle. Therefore, it is possible to estimate the narrowing of the road based on the positional relationship between the preceding vehicle, the oncoming vehicle, and our vehicle.

Therefore, the proposed method extracts features based on surrounding vehicles. Specifically, YOLOv3 is applied to the video from an in-vehicle camera, and the coordinates of surrounding vehicles captured in the video are extracted. In the proposed method, the center coordinates X_n, Y_n of the object corresponding to the label "car" are obtained from each input video frame. The coordinates of the surrounding vehicles are obtained, as shown in Fig.4. coordinates of the surrounding vehicles, X_nm, and Y_nm are obtained. From the coordinates X_nm, and Y_nm of the preceding and opposing vehicles, feature vector consisting of the mean and variance of the coordinates and the amount of change in the coordinates of the vehicles in adjacent frames is obtained.

d) Features based on piled snow

As piled snow on the road shoulder increases, the road tends to narrow. Therefore, it is possible to estimate the road narrowing from the size of piled snow on the road shoulders. Depth estimation is used to calculate features based on piled snow. Depth estimation is a method to measure the distance from a camera to an object by analyzing a two-dimensional video or image. There are two types of depth estimation methods: the monocular depth estimation and the binocular depth estimation. This paper assumes images from an in-vehicle camera installed in an ordinary vehicle. Many vehicles board monocular cameras. Therefore, the proposed method uses monocular depth estimation⁹⁾. Two types of convolutional neural networks, Depth CNN and Pose CNN, is utilized to estimate the depth of an object in three-dimensional space captured in a video image.

The depth map calculated by monocular depth estimation is used to estimate the size of piled snow. The proposed method uses the depth map to extract features based on piled snow. A depth map is calculated for each frame of the input video. An example of the depth map obtained is shown in Fig.5. The distance from the camera to the object is represented by color.

The proposed method uses the depth map to extract features based on piled snow. A depth map is calculated for each frame of the input video. An example of the depth map obtained is shown in Fig.5. The distance from the camera to the object is represented by color.

Next, binarization is performed on the obtained depth map Fig.6(a), where class 0 and 1 are defined by the threshold value t. The threshold t for each pixel of the input image is calculated as follows. Assume P_all is the total number of pixels, P₀ be the number of pixels in class 0, and P₁ be the number of pixels in class 1. Then, R₀ is the percentage of pixels in class 0 and R₀ is the percentage of pixels in class 1 in the total image, as shown in the following equation.

When the average of the luminance of all pixels is M_all , the average within class 0 is M₀ , and the average within class 1 is M₁, the variance between classes, is shown in the following equation,

If the variance within class 0 is and that of class 1 is , the within-class variance , which evaluates the variance of each class overall, is defined as in the following equation.

In this case, the threshold is determined so that equation (2) is maximized, and equation (3) is minimized. Specifically, the optimal threshold t should maximize the between-class variance and minimize the within-class variance .

Here, considers the overall variance S_all = + . The overall variance can be considered to be a constant value and does not affect the threshold value t. Therefore, equation (4) can be transformed as the following equation.

Since the overall variance s is a constant, to maximize equation (4)

should be maximized. Therefore, the optimal threshold t is determined by maximizing the following equation.

Fig.6(b) shows an example of a binarized image using the threshold value t calculated based on the obtained depth map. Next, the least-squares method calculates an approximate curve for the binarized image to reduce noise. An example of the obtained approximation curve is shown in Fig.6(c).

Next, consider an approximate curve f(x) in the image space, i.e., in the X, Y plane. If the number of pixels in the image is w × u, the range of x can be expressed by the equation (7).

Next, find the difference between f(v) when x = v and f(v + 1) when x = v + 1. Let g(x) be the difference obtained. g(x) can be expressed as in equation (8).

The range of v is then as in equation (9).

From this, u − 1 , g(x) is extracted and he u − 1 dimensional feature vector is made up of these elements. Referring to the boundary line between the white and black areas in the image in Fig.7, the boundary line tends to change from a straight line to a curve as the road narrows (Fig.7(a)→ Fig.7(b) → Fig.7(c)).

(2) Calculation of Confidence Levels

This section explains the calculation of confidence levels for each feature. The proposed method uses a classifier ELM to estimate the road narrowing conditions.

ELM¹⁰⁾ is a type of single hidden layer feedforward neural networks (SLFNs), which consist of three layers of neural networks. It enables fast learning speed and universal approximation with a small amount of training data. The ELM is known as a classifier that can ensure discriminative accuracy even with a small number of training data.

First, as training data, we consider a set of feature vectors f_i and constriction levels (where = 1, = 0, = 0 if the constriction level of z_i is 1. The specific calculation of the ELM is as follows. By performing a feature transformation using the sigmoid function g, f_i is calculated according to the following equation.

where u_k(k = 1,2, ..., K) , and v_k(k = 1,2, ..., K) are parameters of the sigmoid function G, and K is the number of nodes in the hidden layer. Next, the weight of the final layer, β, is calculated by the following equation.

Z = [Z₁, Z₂, ..., Z_M]^T , D = [d(f₁), d(f₂), ..., d(f_M)]^T , M is the number of training data.

Finally, in the test data, when the feature vector f is input to ELM, the output value is g = d(f)^T β, and the class label is the class label corresponding to the node that outputs the largest value amongg.

In the proposed method, each feature is put into the classifier ELM and three confidence levels, and a total of 15 confidence, K_c, is obtained.

(3) Integration of confidence levels based on canonical correlation analysis

Canonical correlation analysis¹¹⁾ is a multivariate statistical method for searching correlations between two data. CCA realizes simultaneous evaluation of linear relationships between multiple variables of two data sets.

Two variables, , are linearly transformed to create a space that maximizes its correlations.

Assume the space created is one dimension, x and y is converted in equations below,

Next, we find a, and b such that the correlation coefficient in the following equation is maximized.

a and b are obtained as eigenvectors for the largest eigenvalue of the generalized eigenvalue problem.

In the proposed method, the 15 dimensions of confidence levels, K_c and its ideal form K_c′ is set as two variables. For instance, when the road narrowing conditions are assumed Level1, the ideal shape of K_c′ is shown below,

A Canonical correlation analysis is performed on these two variables to obtain a and b. with the coefficient a , k_c is feature transformed by the equation below.

K_c obtained in 2.(3) is the confidence level calculated in the ELM feature space. Therefore, when constructing a classification using K_c as input, if the feature space is different, it can be considered that appropriate classification will not be performed, so feature transformation using canonical correlation analysis is performed.

(4) Feature vectors obtained by the integration of confidence levels

SVMs¹²⁾, which are widely used in image recognition and other applications, perform classification by mapping feature vectors to a higher dimensional feature space and constructing a discriminative hyperplane. In the proposed method, the discriminant p_i,z,SVM is determined by inputting the feature vector of each b dimension obtained in the previous sections into the discriminant function shown in the following equation.

Note that w, b is a parameter obtained by training. In SVM, the parameter is obtained by maximizing the distance between the discriminative hyperplane and the feature vector closest to the discriminative hyperplane. Therefore, in the proposed method, the parameters are obtained using n training data consisting of feature vectors and their labels. Specifically, the proposed method uses n training data consisting of feature vectors and their labels, with the constraint.

In the proposed method, the feature vector K_c′′ is put into the classifier, SVM, and the road narrowing condition is estimated.

In the proposed method, the confidence level is calculated from each feature, and the SVM is used as input to perform an integrated analysis to classify the features. This ensures that the confidence of each feature is optimized independently and that errors in any feature do not affect the other features. This approach was thought to maintain the stability of classification accuracy.

3. EXPERIMENTS

In this section, the effectiveness of the proposed method is verified by using actual in-vehicle camera images. In 3.(1), an outline of the experiment is explained. Experimental results are shown in 3.(2).

(1) Experimental conditions

An experiment is conducted to verify the proposed method’s effectiveness. Video images from an invehicle camera mounted on a vehicle driven on a four-lane road in Sapporo, Hokkaido, Japan, are used. The data was obtained from the six sections shown in Fig.8. The dataset includes roads narrowed by piled snow, and roads widened after snow removal in the same section. The dataset used in the experiment is shown in Table 1. The number of pixels in each frame of the input video is 640 × 480, and the video is 3 seconds long at 30 fps. The experiment is conducted using five-part cross validation.

The number of segments in each frame as described in 2.(1)a) is l = 2 and m = 3. Generally, the road tends to narrow as the snow on the shoulder increases, as shown in Figures 9(a) and 9(b). These are examples of splitting the lower part of the image with l = 2 and m = 3 . Red lines indicate the boundaries of the segmented areas in these images. The upper left patch is dominated by snow in Fig.9(a), while Fig.9(b) contains objects such as utility poles since the piled snow in Fig.9(b) is not large. The same applies to the upper right patch, depending on the image. Next, by setting m = 3, it is possible to compute the piled snow on the left shoulder, the road surface, and the right shoulder. When calculating the feature vector, the number of features is calculated for each patch, so if l or m is set to an enormous value, the amount of computation increases accordingly. For these reasons, l = 2 and m = 3 are used.

The approximate curve described in 2.(1)d) is a sixth-order approximate curve. In calculating the approximation curve, we derived the 10th-order approximation curve from the 1st-order approximation curve and experimentally verified the sixth-order approximate curve’s effectiveness.

The trained model YOLOv3 uses the open COCO¹³⁾ dataset as training data for object detection and region segmentation. The number of nodes in the hidden layer of the ELM is twice the dimension of the feature vectors.

SVM tends to achieve higher classification accuracythan ELM when the same dataset is utilized. However, SVM does not classify accurately when the number of sample data does not exceed the dimensionality of the input feature vectors. Since the dimensionality of features extracted exceeds the number of the dimensionality of the feature vectors, ELM is utilized in 2.(2). On the other hand, SVM is utilized in 2.(4) as the number of dimensions of the input feature is 15, which is lower than the number of the sample data.

In the experiment, quantitative evaluation is conducted using the following equations,

(2) Comparative methods

This section describes the comparative method performed to confirm the experiment’s effectiveness. The experiment involves four comparative methods. The following sections 3.(2)a)~3.(2)d) describe each of these comparative methods.

a) Comparative method 1

First, the proposed method is compared with the method proposed in reference⁵⁾. In this method, the feature is extracted as explained in 2.(1). Feature vector f_i is made of the following equation.

f_i is then put into the classifier ELM, and the road narrowing condition is classified with the following equation.

b) Comparative method 2

In comparative method 2, the classifier ELM is utilized instead of SVM. By comparing the proposed method and comparative method, the effectiveness of utilizing SVM.

c) Comparative method 3

In comparative method 3, the road narrowing condition is classified without using canonical correlation analysis. The feature vector, k_c , obtained in 2.(2) is directly put into the classifier SVM. By comparing the proposed method, comparative method 3, the effectiveness of feature transformation by canonical correlation analysis is verified.

d) Comparative method 4

In comparative method 4, the estimation is done without using canonical correlation analysis or SVM. Instead, the road narrowing condition is estimated by ELM.

(3) Experimental results

In this subsection, experimental results are shown. Moreover, each feature is replaced with a matrix of random numbers to assume one feature was not extracted properly. In actual situations, some features may not be extracted appropriately due to errors such as snow-covered sensors. The robustness of these methods can be compared by comparing the decrease of accuracy between the proposed method and the method of reference⁵⁾.

Experimental results are shown in Table 2 and Table 3. Table 3 shows the decline rate of classification accuracy when one of the five extracted feature vectors is converted to noise and then classified. The following equation calculates the decline rate.

Table 2 shows that the PM (proposed method) and the CM(comparative method)1 have 75% and 82% of accuracy, respectively. Therefore, the proposed method has 8% lower accuracy than the comparison method 1. However, Table 3 shows that the average degradation rate of the PM is 9%, while that of the CM1 is 13%. When a feature is converted to noise, the PM degrades 31% less than the CM1. Therefore, comparing the proposed method with the CM1, it can be said that the proposed method achieves a more robust classification.

Next, Table 2 shows that the PM, CM2, achieves higher classification accuracy than CM3 and CM4. Therefore, the effectiveness of feature transformation based on canonical correlation analysis in terms of classification accuracy is confirmed. Comparing comparative method 2 and comparative method 4, comparative method 4 has a lower decline rate than comparative method 2. This was considered since comparative methods 2 and 4 use ELMs as a classifier instead of SVM. When features are translated using CCA in comparative method 2 , its value range is transformed so that the features are concentrated around the values of 0 and 1. In other words, this ELM specialized in classifying features with values near the values of 0 and 1. Feature vectors with noise contain random values not concentrated around 0 and 1, resulting in a lower decline rate compared to comparative method 4. On the other hand, SVM learns the feature space to maximize its margin. Therefore, the proposed method achieved a lower decline rate compared to comparative method 2.

Table 2 shows that the CM1 has higher classification accuracy than the PM in all the evaluation criteria except for the Level 2 precision rate and the Level 3 recall rate. This result means that the improvement of classification accuracy is the future work for PM. Overall, the experimental results indicate that the proposed method effectively reduces classification accuracy when features are not extracted appropriately.

4. Conclusion

This paper proposed a highly robust method for classifying road narrowing conditions utilizing latefusion and feature transformation based on canonical correlation analysis. Experimental results confirmed the effectiveness of the proposed method. However, the method used in the previous study has higher estimation accuracy. Therefore, for future works, improving the accuracy of the proposed method remains.

In addition, from the viewpoint of improving the efficiency of snow removal operations, it would be possible to achieve smoother road management if it were possible not only to classify road constrictions but also to predict which roads are likely to become narrowed. If the acquired images from in-vehicle cameras can be analyzed based on time-series relationships, it is expected to improve the efficiency of snow removal planning further. In addition, mapping the acquired information on maps and other media will reduce traffic congestion and traffic accidents and provide helpful information for snow removal planning and all road users.

ACKNOWLEDGMENT

This work was partly supported by JSPS KAKENHI Grant Number JP22H01607, and by the Committee on Advanced Road Technology under the authority of the Ministry of Land, Infrastructure, Transport and Tourism in Japan (Project name "Technological development of digital twin-oriented transportation management system in winter road", Principal Investigator" Assoc Prof. Sho Takahashi, Hokkaido University).

References

1) El-Wakeel, A. S., Li, J., Rahman, M., Noureldin, A. and Hassanein, H.: Monitoring road surface anomalies towards dynamic road mapping for future smart cities, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, pp. 828-832, 2017.
2) Eriksson, J., Girod, L., Hull, B., Newton, R., Madden, S. and Balakrishnan, H.: The pothole patrol: using a mobile sensor network for road surface monitoring, The 6th Inter-national Conference on Mobile Systems, Applications, and Services, IEEE, pp. 29-39, 2008.
3) Omer, R. and Fu, L.: An automatic image recognition system for winter road surface condition classification, 13th International IEEE Conference on Intelligent Transportation Systems, IEEE, pp. 1375-1379, 2010.
4) Varadharajan, S., Jose, S., Sharma, K., Wander, L. and Mertz, C.: Vision for road inspection, IEEE Winter Conference on Applications of Computer Vision, IEEE, pp. 115-122, 2014.
5) Kinoshita, Hiroki, et al. "A Method for Classifying Road Narrowing Conditions Based on Features of Surrounding Vehicles and Piled Snow in In-vehicle Camera Videos." 2022 IEEE International Conference on Consumer Electronics-Taiwan. IEEE, 2022.
6) Snoek, Cees GM, Marcel Worring, and Arnold WM Smeulders. "Early versus late fusion in semantic video analysis." Proceedings of the 13th annual ACM international conference on Multimedia. 2005.
7) Ergun, Hilal, et al. "Early and late level fusion of deep convolutional neural networks for visual concept recognition." International Journal of Semantic Computing 10.03 (2016): 379-397.
8) Redmon, J., Divvala, S., Girshick, R. and Farhadi, A.: You only look once: Unified, real-time object detection, Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition, pp. 779-788, 2016.
9) Geiger, A., Lenz, P. and Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite, 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 3354-3361, 2012.
10) Huang, G.-B., Zhu, Q.-Y. and Siew, C.-K.: Extreme learning machine: theory and applications, Neurocomputing, Vol. 70, Issues 1-3, pp. 489-501, 2006.
11) V. Vaibhav, K. R. Konda, C. Kondapalli, K. Praveen and B. Kondoju, "Real-time fog visibility range estimation for autonomous driving applications," 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1-6, 2020.
12) C. Cortes and V. Vapnik: Support-vector networks, Machine learning, Vol. 20, no. 3, pp. 273-297, 1995.

Corresponding author

Register with J-STAGE for free!