2024 年 5 巻 1 号 p. 15-24
The upsurge in road crashes is a global challenge, and it poses a serious threat especially in low-and-middle-income countries due to incessant road crashes claiming many lives. Road safety management focuses on mitigating crashes by predicting the frequency and severity of the crashes and building safer roads. Predicting crash severity is essential because it is always preferred to avoid severe crashes. Factors like the time of the crash, type of collision, type, and number of vehicles involved in the crash, and the road features like geometric properties, pavement condition, and surrounding environment of the crash location can govern the severity of crashes. Due to the involvement of numerous factors in crash occurrence, predicting crash severity with high accuracy is a difficult task. Tree-based classification models like decision trees and random forests are two types of machine learning algorithms widely used to predict crash severities because they are considered to produce accurate predictions. The objective of this study is to develop road crash severity prediction models for a mountainous highway using the two tree-based algorithms and to assess various factors that affect the severity of the crashes by taking into account two different data treatment approaches: a) only considering the type of vehicle involved in a crash, and b) considering both the type and number of vehicles involved in a crash. The performance of the two treebased models: decision tree and random forest models for two separate types of data treatment will be assessed and compared. The models can be used to predict crash severity and the results may be useful for road agencies to identify and select safety countermeasures that can contribute to lower crash severity.
(1) Background
Every year more than one million people die because of road crashes. In recent years, developed countries have successfully been able to reduce crash numbers, but low- and middle-income countries are struggling to control the upsurge. Approximately 92% of total road crash fatalities occur in low- and middle-income countries, despite the share of registered vehicles in these countries being quite low. Road crashes also impact a nation’s economy because the statistics show that the loss due to road crashes, like fatalities, injuries, and property damage, contributes to around 3% of gross domestic product1). These losses, when combined with intangible losses like emotional pain and suffering, make road crashes one of the hugest challenges to mankind. Due to the utmost necessity to mitigate the occurrence and severe outcomes of road crashes, the Sustainable Development Goals (SDGs) Target 3.6 was established, with the goal of halving road crashes by the end of 2030.
Road crashes occur due to the influence of the interaction between three factors: humans, vehicles, and the environment. In a study conducted in Indiana, USA, it was found that 93% of crashes involve human factors, 13% of crashes are related to vehicle factors, and 34% of crashes are related to roadway conditions2). The interaction between these factors is very complex because human behavior is difficult to predict as it differs from person to person, and the road environment has diverse aspects to consider. This wide variation in the characteristics of these factors governing the crash occurrence and its severity makes prediction difficult and complex. Prediction of crash occurrence and severity comes under one of the important principles of the Safe System approach, which provides a guiding framework for making places safer for people’s mobility. It focuses on the proactive approach of identifying and mitigating the risks in the transportation system, rather than reacting after the occurrence of a crash3). Prediction of the crash severities can provide road management authorities information about probable crashes that may occur in the future so they can plan accordingly to mitigate the severe outcomes by identifying and selecting appropriate interventions.
Although it is always desirable to reduce road crash frequency, the reduction of severe crashes is even more significant for road safety practitioners. This is because the ultimate goal is to ensure that no person will die or get seriously injured while using the road. Therefore, predicting crash severity has been a field of interest for researchers and safety practitioners for a long time. Plenty of research has been conducted using probabilistic and deterministic methods for crash severity prediction. Artificial Intelligence and machine learning techniques have become popular because of their ability to achieve more accurate results, and many researchers have considered these techniques for crash severity predictions. A study performed by Malik et al. showed that, among different machine learning algorithms used for predicting road crash severity, Random Forest, Decision Tree, and Bagging techniques provided significantly better accuracy than other machine learning algorithms4). Similarly, Zhang et al. predicted crash severity using statistical and machine learning and found that machine learning had better prediction accuracy and, among the four machine learning techniques (Random Forest, Decision Tree, Support Vector Machine, and K-Nearest Neighbor) used, Random Forest model showed best accuracy5).
To build a robust model with high accuracy, not only the analytical tools but also the features (independent variables) must be carefully selected. After a review of existing literature, it was found that features related to humans, the road environment, and the crash itself were the most used in prediction models. For example, a crash severity analysis performed in Bangladesh showed that the Random Forest model using a label encoder with chi-square and two-way ANOVA feature selection process provided the best prediction accuracy, and the significant variables related to crash severity included driver characteristics (gender, license type, seat belts), vehicle characteristics (vehicle type), road characteristics (road surface type, road classification), environmental conditions (day and time the crash occurred) and injury localization6). Chen and Jovanis7) performed a study to identify factors contributing to driver-injury severity in road crashes and they found that crashes involving buses, large trucks, or tractor-trailers increased the risk of severe injury and the crashes occurring late at night or early morning would lead to severe injury when drivers were involved in rear-end type of collision. Furthermore, according to a literature review by Santos et al.8) of 56 studies on road crash severity prediction between 2001 and 2021 using more than 20 different statistical and machine learning techniques, it was found that the random forest algorithm was the most widely used technique (29% of all studies) and it achieved the best performance among examined algorithms in 70% of the situations in which it was used. Similarly, decision trees were used in 14% of all studies and showed the best performance 31% of times it was used. From the results of this comprehensive review, it can be concluded that tree-based machine learning algorithms are the most popular and best techniques for predicting crash severity. However, no literature was found that considered the number of vehicles involved as one of the independent variables to determine the severity of a road crash, as most of the studies only considered the type of vehicles involved. Also, the preceding studies used statistical approaches to examine the interrelationship with crash severities. Thus, there is a gap in the road crash knowledge regarding the effect of the number of vehicles involved in the crash, especially when modeled using more advanced analytical techniques.
(2) Research objectives
The primary objective of this study is to develop crash severity prediction models for a semi-urban mountainous road using tree-based machine learning techniques. Two data treatment approaches were adopted for developing the models: a) considering only the type of vehicle involved in a crash, and b) considering both the type and number of vehicles involved in a crash. The two datasets were analyzed using the decision tree and random forest algorithms to predict crash severity. The predictive accuracy of the four modeling approaches (two data treatments and two machine learning techniques) was compared using the accuracies of the models and besides, the variable importance for the models was also explored to identify significant variables that determine crash severity. The results would provide researchers with an idea about how the inclusion of information about the number of vehicles involved in the crash impacts the severity of the crash. In addition, the results would also provide evidence about which tree-based machine-learning model would show a better predictive performance while analyzing a problem of this type.
(1) Data collection
A mountainous, semi-urban highway named Dhulikhel-Sindhuli-Bardibas (H13), a 160 km long highway in Nepal, was selected for this study. This highway has a very high socio-economic importance as it connects the capital Kathmandu with the plain region of the country. The highway passes through mountainous, rolling, and steep terrain and follows hilly and river routes. The road sections in hilly areas are narrow and have steep gradients, with numerous curves, whereas the road sections following river routes are relatively straight with flatter gradients. Considering the geographical terrain and traffic volume, the design speed between 20 to 40 kmph was adopted while designing various elements of the highway. Also, from the spot speed study performed at different 60 spots on the highway, the average speed was calculated to be 39.35 kmph and the 85th percentile speed was found to be 50 kmph9).
Road crash data from the years 2015 to 2022 were obtained from the Nepal traffic police. In total, 722 road crashes were recorded, with 193 people losing their lives and 895 people seriously injured while traveling on this highway. In Nepal, the traffic police define a crash to be fatal if a person dies within 30 days of the incident, and as a serious injury if a person either has major bodily injury or is unconscious after the crash10). The crash data was coded by classifying into four categories: fatal, major injury, minor injury, and property damage only crashes. It was found that 78% (566) of the total crashes were crashes leading to either fatal or serious injury. This shows that the probability of severe outcomes is very high if a road crash occurs. The crash data also had information about the time of the crash occurrence, the number and type of vehicles involved in the crash, and the type of collision.
As previously discussed, the road environment also plays a vital role in crash occurrences and hence it was deemed necessary to consider information about the road conditions where the crashes occurred. In this study, the road characteristics of road sections 250 meters in length were used instead of the characteristics of the exact crash spot. This is because, in some cases, it is difficult to locate the exact crash location during the investigation by the traffic police, and therefore approximate chainage of the roadway is used. Furthermore, for any crash to occur, usually the characteristics of the entire roadway environment come into effect rather than only the point of collision. Thus, the entire highway was divided into sections of 250 meters and the crashes were assigned to each section based on their location, and henceforth the characteristics of these sections were used for the predictive modeling.
The data on the road geometry and features were extracted from the design and as-built drawings of the highway and included information about carriageway width, shoulder type and width, length and radius of curves, and longitudinal gradient. The pavement condition of the highway segments was examined using the International Roughness Index (IRI). The data on IRI was obtained from the annual survey conducted by the Department of Roads, Nepal, and the project office for the construction, operation, and maintenance of the study highway. In addition, data on the existing road features were collected by field survey in July 2022 and include data on the presence and condition of various safety measures like crash barriers, signposts, road markings, delineation measures, pedestrian crossings, sidewalks, and street lightings. In addition, information about the surrounding environment, like the number of accesses in each segment and the level of ribbon development along the highway segments, was also gathered during the field survey.
(2) Data treatment
All the data collected were cleaned and processed to produce either numeric or categorical data. The data for the existing safety measures consisted of subjective scores provided by safety inspectors, which were judged in reference to prevailing departmental and international guidelines during the field survey. These scores were provided in three levels (good, fair, and poor) based on their condition at the time of inspection. "Good" was assigned if a particular feature or existing feature in any segment is in good or desirable condition and adheres to the safety requirements. Similarly, "fair" and "poor" are assigned if the road features were found to be in decent or bad condition, respectively. For example, if the edge markings were available throughout the section, were visible, and provided as per national standards, then the condition would be rated "good". Similarly, if the edge markings were available on only one side of the carriageway, or partially erased, then such condition would be rated as "fair," and if the edge markings were completely absent, then it would be rated as "poor."
Although the crash data were available for eight years, only the data for 2021 and 2022 were used because the condition of most of the features was unknown before 2021. For example, the condition of features like pavement condition and existing safety features change over time. Literature suggests such features are important for predicting crash severity, and crash severity prediction without these features may make the results less reliable. The two-year crash database comprised 220 crashes in total, of which 25 were fatal crashes, 114 major or serious injury crashes, 36 minor injury crashes, and 2 property damage-only crashes. Due to the extremely low number of property damage-only types of crashes present, they were removed from the database and the crash severity prediction was performed using only the other three severity.
The crash database showed the involvement of seventeen different types of vehicles, and using all these vehicle types in the analysis would add more randomness to the data, which may result in a model with poor accuracy. Therefore, to address this issue, the vehicles involved in the crashes were categorized into five categories based on the size and use of the vehicles. Table 1 summarizes the types of vehicles grouped into five categories.
One of the objectives of this study was to assess and compare the results of crash prediction when considering and not considering the number of vehicles involved in the crashes. Hence, two separate datasets were prepared for these two approaches. The dataset considering only the vehicle category type involved was prepared by coding the vehicles using the letters for vehicle group category names. For instance, if a single car is involved in a crash, the vehicle type is coded as "C," and if a motorbike and car are involved then it is coded as "AC." If three vehicles were involved, all three letters were put together to code those crashes. On the other hand, the dataset for the analysis considering both vehicle category type and number was prepared by treating each variable category type as a separate variable and the data for these were the number of vehicles of that particular category involved in each crash.
Crash time was transformed into four categories a) midnight to early morning (0:00 to 6:00), b) early morning to noon (6:00 to 12:00), c) noon to evening (12:00 to 18:00), and d) evening to midnight (18:00 to 00:00). From the preliminary analysis of the crash data, nine different types of crashes were found, and this variable was treated as categorical variables by using codes. Descriptive statistics for the continuous and categorical variables are shown in Table 2 and Table 3, respectively.
(3) Analytical approach
In this analysis, 80% of the data was used for training and the remaining 20% was used for testing the accuracy of the model. The training data is used to train the model and the testing data to check the accuracy of the built model. In other words, the algorithms learn the patterns in the data and build a predictive model from the training dataset, whose accuracy is then assessed using the test data.
The two prepared datasets (considering and not considering the number of involved vehicles) were analyzed using decision trees (DT) and random forest (RF) models to predict the crash severity and to explore the significant variables contributing to the prediction. Decision tree analysis is a non-parametric supervised machine learning technique used for classification and regression. It has a hierarchical tree structure which consists of a root node, branches, decision (internal) nodes, and leaf nodes. Decision trees use the concept of recursive partitioning where a node is split into two or more sub-nodes based on splitting criteria like information gain and Gini impurity, and the splitting continues until the stopping criteria are met11). The results from decision tree analysis are easy to interpret and, since it can process different types of data together, data preparation is much simpler than that for other machine learning algorithms. However, the major drawback of decision tree analysis is that it is prone to overfitting, and even a small change in the data can provide different results.
Random forest is another type of tree-based machine learning technique that draws conclusions based on the results of many decision trees built from subsets of the given dataset. It is an ensemble model that takes predictions from each tree and makes a final prediction based on the majority votes (averages) of the predictions. The major strengths of random forest include improvements in prediction accuracy, reduced chances of overfitting and it can easily process large datasets with high dimensions12). However, the results are difficult to interpret, which is the major weakness of this analytical technique. To optimize the performance of the random forest model, it is necessary to tune certain parameters of the model before training the data in a process called "hyperparameter tuning." Among the different parameters that can be set before the training, the most important ones include the number of decision trees in the forest and the number of features considered by each tree when splitting a node. Grid search with cross-validation was used to find the optimum number of trees and features.
As the parent dataset in this study is relatively small (218 observations), assessment of the models’ predictive strength using only the accuracy from the test (validation) set can provide unrealistic results. This is because, when splitting the dataset into training and validation sets, there are chances that useful data can be left behind on the other set. Thus, for random forest models, the out-of-bag (OOB) score was also considered for model evaluation. The OOB score is computed as the number of correctly predicted rows from the out-of-bag samples (that are not used in the training of the model). In the random forest algorithm, two conventional measures are used to analyze variable importance; a) mean decrease in accuracy, and b) mean decrease in node impurity. The former measure uses the permuted OOB data. For each tree, the error rate for classification on the out-of-bag portion of data is recorded and the same is done after permuting each predictor variable. The difference between the two is then averaged over all the trees and normalized by the standard deviation of the differences. The latter measure is based on the total decrease in node impurities (Gini index) from splitting on the variable, averaged over all trees13). The open-source statistical analysis tool R (https://cran.r-project.org/) was used for analyzing the data. The R packages "rpart"14) and "randomForest"15) were used for the decision tree and the random forest analysis respectively.
(1) Decision tree models
(a) Dataset considering only vehicle type
Fig.1 shows the decision tree for the dataset that considers only the vehicle category. It can be seen that the first feature that splits the root node is the type of vehicle category involved in the road crash. The partitioned nodes are further split based on features like crash type and carriageway width. This implies that these features which are involved in the early splitting of nodes are more significant in predicting the severity of crashes. The top five important variables involved in predicting (splitting the nodes) are listed in Table 4.
The trained model classifies 8% of the observations as fatal crashes, 73% as major injuries, and the remaining 19% data as minor injuries. In Figure 1, only the leaf on the extreme right has one single class (minor injury) observation and is an example of a pure node. However, all other leaves have all three types of classes and are said to be impure nodes. The leaf showing fatal as an outcome has classified 8% of total observations as fatal. The combination of rules that leads to this leaf predicts fatality as the outcome, as most observations that fall into this category (set of rules) are fatalities. But it can also be noted that the same leaf also has observations from other classes (major and minor injuries). This shows that each leaf (impure node) has errors and such errors in each leaf contribute to the overall prediction error of the model.
The accuracy of the overall model can be clearly interpreted using a confusion matrix, which is developed using the observations from the test set. The matrix shows the frequency of predicted classes to observed classes and is used to determine the performance of classification models. From the confusion matrix in Table 4, out of six observed fatal crashes, the model predicts it as fatal only one time and as a major injury five times. This indicates that the model underpredicts fatalities, which is a serious issue from the practical viewpoint of road safety management. Similarly, out of 28 observed crashes with major injuries, the model correctly predicts it 17 times and underpredicts it as a minor injury 10 times, which again is a concern from the perspective of road safety. It also predicts major injury as a fatal injury once and predicts minor injury as a major injury eight times. However, these overpredicted results are less of a concern because they may lead to extra precautions and interventions, which can ultimately make roads safer and more forgiving.
b) Dataset considering vehicle type and number
The decision tree considering both the vehicle category and number of vehicles involved is shown in Fig.2. In this model, ribbon development is the first feature that splits the root node and, furthermore, the branch with no ribbon development leads to a leaf that contains 59% of the entire observations. In addition, the decision tree shows that if there is ribbon development with wider roads, and if crash types such as vehicle hitting a pedestrian or any obstacles, head-on, and rear-end collision occur, and if there is involvement of one heavy (category E) vehicle, then the severity is likely fatal.
The most significant variables, prediction accuracy, and confusion matrix for this model are summarized in Table 5. As before, fatal crashes are underpredicted, and minor injuries are overpredicted. However, the major injury was correctly predicted 23 times out of 28 observations (82% accuracy). Using this approach of data treatment (including the number of vehicles involved in a crash) improved the overall accuracy of the model to 60.5%, but major concerns regarding the underprediction of fatal crashes remain.
(2) Random forest models
a) Dataset considering only vehicle category
For the dataset considering only vehicle category, the results from the Grid search with cross-validation for hyperparameters tuning showed that increasing the number of trees beyond 95 did not show further improvement in accuracy. However, 1000 trees were used in this final analysis as the larger number will not overfit the model. It was also found that using 13 features for splitting a decision tree node provided better model accuracy.
The most significant features in the model, the OOB score, the test accuracy with a 95% confidence interval, and the confusion matrix, are shown in Table 6. Ribbon development, IRI, edge marking, and carriageway width were among the top important road features along with vehicle category. The OOB score shows that the trained model has an accuracy of 63.7%, which is even better than the test set accuracy of 61.9%. This accuracy has significantly improved compared to the decision tree model with an accuracy of 44.2%. However, the issue of underpredicting fatal crashes remains, as not a single fatal crash out of the six observations was predicted - instead, they were predicted as either major or minor injuries. The accuracy for predicting major injuries is very good (89%) but, again, most minor crashes have been overpredicted as major crashes (7 out of 9). As discussed before, overprediction can be acceptable from a safety perspective, but underprediction of fatal crashes is a serious challenge to be solved.
b) Dataset considering vehicle type and number
The results of the hyperparameter tuning for the random forest model considering both the vehicle category and number of vehicles showed an optimum number of 107 trees and 10 features available for splitting the nodes and providing the best prediction accuracy. However, as in the previous random forest model, 1000 trees were used to build the model. From Table 7, the most significant variables are almost identical to the previous random forest model. Ribbon development, carriageway width, and edge markings are the common features in both models that govern the prediction of crash severity. In addition, the number of heavy vehicles "Category E" was also found to be an important predictor.
The OOB score shows that the training model has an accuracy of 65.5%, which is better than the previous random forest model. However, the model has an overall validation set accuracy of 59.5%, which is slightly lower than the decision tree model (60.5%) which considers both the number and category of vehicles. The OOB score and the test set accuracy are two completely different metrics because the procedure and data involved in their calculations are different. Although the OOB score may not be a reliable estimate of the generalization error of the model, it can provide useful information about model performance. The confusion matrix for this model is also almost similar to the previous random forest model, where the most concerning result is the underprediction of fatal crashes.
(3) Overall findings
All four models show that the ribbon development phenomenon is a very significant factor for predicting crash severity. A lower level of ribbon development can lead to decreased conflict between vehicles and humans on the roads, and the driver may prefer to overspeed, which can result in serious injury and fatal crashes. Other important road features include carriageway width and edge marking. Among the crash-related features, crash type was found to be important in both decision tree models, but not in the random forest models. Similarly, the number of large vehicles (Category E) was found to be significant in the model, which explicitly shows that the number of vehicles involved in crashes is a very important feature that determines crash severity and hence should be included in predictive models. The number of vehicles involved in the crash is beyond the control of the road authorities and thus, from the practical viewpoint, its inclusion may not be relevant for crash severity prediction. However, considering it in the predictive model has provided insightful knowledge that the number of vehicles does affect the severity of the crashes, and this vital information may be used in the future for further analysis and managerial decisions. Interestingly, the crash time was not found to be a significant predictor in the model.
Comparing the overall prediction accuracy of the models when not considering and considering the number of vehicles, the latter type of data treatment showed better prediction accuracy in the case of the decision tree models but showed similar accuracy for the random forest models. When examining the results of the four prediction models, it can be said that the random forest models are more accurate as they are based on results from multiple decision trees. However, even though the decision tree models provide lower overall prediction accuracy, they were able to predict certain crash severity with higher accuracy. Notably, both decision tree models include leaf nodes that lead to fatal crashes, and they both predicted fatal severity at least once, whereas both random forest models were unable to predict fatal severity even a single time. In addition, decision tree models are easy to interpret and visually show how certain features govern the prediction process. However, the choice of technique between these two may depend on the type of data to be analyzed and the objective of classification. For instance, if transparency and interpretation are more essential, then a decision tree would be a proper choice. Similarly, if data is complex and contains a lot of noise, then random forest may be an appropriate choice between the two.
The issue of underpredicting fatal crashes as major or minor injury crashes is mainly due to the imbalanced data, wherein most of the observations were major injury crashes (142 out of 218, or 65%). This is the major drawback of machine learning, where the models have greater predictive power for the majority class and less predictive power for the minority class16). Such imbalanced data can be treated by resampling using various techniques to produce more accurate predictive models.
The goal of this study is to develop a crash severity prediction model using two tree-based machine learning algorithms (decision tree and random forest) with two different data treatments (considering only the type of vehicle involved in a crash and considering both the type and number of vehicles involved in a crash), and then to compare the predictive strength of the four models. Two-year road crash data from a semi-urban mountainous road in Nepal was used for the analysis. Several features related to road geometry, environment, existing safety features, and crash features like time of crash, type of collision, and type and number of vehicles involved in a crash were considered in building the predictive model.
Ribbon development followed by carriageway width and edge markings were the road-related features found to be significant in most of the models. Also, the type of vehicle was found to be significant in predicting crash severity along with the number of large-sized "Category E" vehicles. Crash time was not considered to be important in any of the models. The data treatment considering both the category and the number of vehicles involved in a crash and modeled using the decision tree algorithm showed significantly better accuracy than the other dataset which did not consider the number of vehicles involved. For the random forest model, the accuracy is almost the same for both data treatments. Overall, the random forest models showed better predictive accuracy than the decision tree models. The confusion matrix for all four models showed that the fatal crashes are underpredicted, whereas the major injury crashes are overpredicted. This can be a serious issue because these misleading results can force concerned authorities to overlook the criticality of fatal crashes, and as a result road safety conditions can further worsen. The prediction error for each crash severity is associated with the imbalanced data, where most of the observations in the dataset comprised of major injury crashes.
The information about road features identified from this study that are found to have a significant role in crash prediction can be an important asset to road safety practitioners because those features can be improved or highly considered while formulating any safety improvement projects. The crash severity models developed using the random forest algorithm showed an accuracy of around 60% which is on par with the accuracy of the models developed by other researchers and therefore can be used by road agencies for predicting crashes in semi-urban mountainous roadways. However, caution should be taken concerning the underprediction of fatal crashes when applying these models to real-world cases.
The use of limited-period crash data (2 years) can be considered as one of the limitations of this study. Also, the information about human characteristics, for example, the age, sex, and other characteristics of drivers was not available and hence not included in the model. The inclusion of information about drivers involved in the crashes could help in identifying other significant features that govern crash severity and the use of a larger dataset may increase prediction accuracy because it is known that the accuracy of machine learning models increases with an increase in dataset. The prediction models could be further improved by treating and resampling the imbalanced dataset using different statistical and machine learning techniques available. Moreover, it is anticipated that the results from this study can assist other researchers in building crash severity prediction models with higher accuracy and by including more significant features.
This research was supported by a scholarship for road asset management from the Japan International Cooperation Agency (JICA). The authors are grateful to the Department of Roads, Nepal, for providing the necessary design and as-built drawings of the study highway.