Characterization of missingness and data-driven imputation for incomplete pavement condition data

Angela ODERA; Michael HENRY; Azam AMIR

doi:10.11532/jsceiiai.5.1_57

Abstract

As with datasets in many fields, pavement management systems suffer from missing data, but machine learning techniques such as random forest analysis make imputation a viable solution. This study applied missForest, one implementation of random forest, to impute missing international roughness index (IRI), structural number (SN), and pavement condition index (PCI) data in the Kenya paved road inventory and condition survey database. The database also contains complete region, road class, carriageway surface type, road usage and visual surface condition rating data. With imputation methods influenced mainly by data distributions, missing mechanisms and correlation between variables and less by other data features such as missing rates, the study examined the distributions of the IRI, SN, and PCI data variables and investigated the missing data mechanism in the subject dataset towards confirming the applicability of missForest for imputation. It was found that the three variables follow highly skewed complex distributions and that the missing data is missing not at random (MNAR). Applying missForest to 19 combinations of impute and predictor variables, it was found that the combination of IRI, SN, and PCI impute variables with visual surface condition rating as the predictor variable gave the most accurate imputation in terms of normalized root mean squared error (NRMSE). A reliability check of variablewise missForest imputation in terms of mean squared error (MSE) revealed that the imputation was accurate for SN and PCI but not for IRI due to an extreme missing data rate of almost 90%. The study highlights that low-cost visual pavement condition survey on an entire road network with measurement of superior condition parameters on a sample of it followed by data-driven imputation sufficiently supports management decisions.

1. INTRODUCTION

(1) Background

Transportation agencies are increasingly embracing data-driven pavement management approaches, which are recognized as more effective than the traditional reliance on expert opinions and user concerns for assessing pavement conditions and making preservation decisions. Of importance is the utilization of pavement management systems (PMS) that incorporate diverse types of road characteristics and condition data as inputs in performance and predictive modelling to determine cost effective maintenance interventions and strategies. However, as is common in practice and in obtaining datasets across many fields, PMS suffer from the problem of missing data. Missing values in one or more variables occur due to a variety of reasons, including human error, machine error, inadequate or incomplete data collection, and differences in recording data by different sources.

Missing data complicates analysis, and it is not enough to exclude incomplete observations as a significant amount of data may be lost, thus reducing statistical strength of and introducing bias into the analysis leading to inaccurate results and conclusions^1-3). While effort could be put into managing the quality of data at the collection stage, advances in computational technologies, particularly those involving machine learning algorithms that can analyze big data, are making imputation a viable solution to the missing data problem. Implementation of these algorithms however requires careful thought to ensure missing data is appropriately imputed and suitable for subsequent analaysis.

(2) Previous studies

Outside the pavement management field, previous research has extensively delved into comparing the performance and determining the accuracy of various imputation methods, and random forest-based methods have stood out as providing some of the best imputational results. In meteorology, Diouf et al⁴⁾ found that missForest, a random forest-based method, outperformed four other methods - k-nearest neighbours (k-NN), time series missing value imputation, multiple imputation by chained equations (MICE) and probabilistic principal components analysis - in terms of root mean square error, correlation coefficient and standard deviation for imputing missing temperature series data. In animal science, You et al¹⁾ tested eight machine learning algorithms, including random forest and missForest, in multivariate imputation of values of four variables based on six other variables of a commercial dairy cow herd performance dataset. They found that random forest outperformed all the other methods in terms of relative root mean square prediction error and concordance correlation coefficient. In hydrology, Umar and Gray²⁾ tested single and multiple imputation methods with univariate and multivariate river water level data. They found that missForest, a single imputation method, gave some of the best results in terms of root mean squared error and the mean absolute percentage error for multivariate imputation. In genomics, Petrazzini et al³⁾ also tested single and multiple imputation methods on univariate and multivariate genome-wide features data. They found that random forest outperformed four other methods, including k-NN and MICE, in terms of root mean squared error for both univariate and multivariate imputation. In the field of pavement management, Marcelino⁵⁾ tested three methods - k-NN, MICE and missForest - on a multivariate pavement dataset with climatic, traffic, international roughness index (IRI), pavement thickness and structural number (SN) measures with missing values in all except the traffic and SN variables. It was found that missForest outperformed all the other methods in terms of normalized root mean squared error (NRMSE).

With such promising results, it is sensible that random forest-based methods would find increasing application in the pavement management field. Yet only a few studies have emphasized examining the applicability of imputation methods in relation to different features of datasets, including the all important missing data mechanisms. Missing data has been said to have one of three mechanisms that are inherently build into the assumptions of imputation methods and influence their performance⁶⁾: missing completely at random (MCAR), i.e. missing values are independent of both observed and unobserved data; missing at random (MAR), i.e. missing values are related to observed but not to unobserved data; and missing not at random (MNAR), i.e. missing values are related to both observed and unobserved data^4,⁷⁾. Of the studies already introduced, Umar and Gray²⁾ found that missForest performs the best with MCAR and MAR data but not with MNAR data. However, Misztal⁸⁾ stated that it provides the lowest imputational errors with both MAR and MNAR data. Hong and Lynn⁹⁾ pointed out that missForest does not perform well with highly skewed variables in non-linear models, but this is different from Stekhoven and Buhlmann’s¹⁰⁾ finding that it can handle complex interractions and non-linear relations. Stekhoven and Buhlmann also highlighted its nonparametric nature, which makes it independent of data belonging to a particular finite distribution. While showing different performances in different data contexts, missForest can overall be deemed an efficient imputational method as it has been proved to give the least imputational errors in many studies. Ge et al¹¹⁾ also determined that missing mechanisms, value distributions and the correlation between variables were the main factors affecting imputation methods compared to sample sizes, missing rates and the number of missing variables.

(3) Research gap and study objectives

While previous studies have been insightful to understanding the theories of data missingness and the considerations of choosing imputation methods, there is still need to provide guidance on an implementation framework that well accounts for all relevant aspects of imputation towards high accuracy. This study seeks to present such a framework, that would be useful particularly for practitioners such as road network managers, whose concern is always the practicality of research findings. The framework and its application to actual road data in this study should also contribute to clarifying the proper use of machine learning algorithms in data imputation.

Hence in this study, the distributions of variables with missing data are examined and the missing data mechanism is investigated for a multivariate dataset containing pavement inventory and condition data. While it has been opined previously that it is impossible to distinguish MAR and MNAR data^12-14), a means of identifying if missing values are significantly associated with unobserved multivariate data is proposed. Based on the findings of the distributions and missing data mechanism analyses, the validity of the missForest technique is confirmed before applying it to imputing missing values in the dataset, taking into account that different variable combination scenarios would impact imputation performance in the case of MAR and MNAR data. Particularly for developing countries that are still struggling with implementing accurate methodologies for collecting and storing PMS data, it is also hoped that this approach will provide respite for realizing comprehensive data, much needed for performance evaluation and modelling pavement management strategies.

2. METHODOLOGY

(1) Study area and data acquisition

The data used in this study was obtained from the Kenya Roads Board road inventory and condition survey (RICS) repository and represents the road characteristics and condition parameters of the entire Kenya road network as of 2018 (Fig.1). The country road network is classified into Classes S, A and B Roads (rural highways and urban arterials), Classes C and D Roads (rural and urban collectors) and Classes E, F and G Roads (local roads). The total road network has a length of 246,757 km, of which 17,652 km (about 7%) were paved and 229,105 km (about 93%) were unpaved as of 2018¹⁵⁾. Classes S, A, B and C roads are managed at the national level by three road authorities, the Kenya National Highways Authority (KeNHA), the Kenya Rural Roads Authority (KeRRA) and the Kenya Urban Roads Authority (KURA). The Kenya Wildlife Service also manages roads located within national parks. Classes D, E, F and G roads are on the other hand managed by 47 county governments, which are semi-autonomous decentralized units of governance. The standards for construction and maintenance are however set by the Ministry of Roads and Transport while the Kenya Roads Board manages the road maintenance fund and is the custodian of the country RICS database.

The paved road network dataset was extracted for the analysis considering that it was targeted for complete measurement of condition parameters in the RICS 2018 compared to the unpaved road network, meaning that missing values in any of its variables is due to reasons other than the decision not to measure at all. After screening to account for apparent outliers and duplicate values in various variables, the resulting dataset consisted of 52,192 rows of road sections of various maintenance lengths which represent actual homogenous maintenance units defined by the management authorities in the course of their maintenance regimes. The total length of the cleaned data is 17,300 km. The variables included in the analysis dataset are road class, region, carriageway surface type, road usage (representing traffic level on a 4-point scale of "rare," "used," "busy," and "very busy"), visual surface condition rating (on a 3-point scale of "good," "fair," and "poor"), IRI, SN and pavement condition index (PCI), with the IRI, SN and PCI columns containing missing values. These missing values were the target of imputation in this study. A sample of the dataset is shown in Table 1 while Table 2 shows the percentages of missing IRI, SN and PCI data.

The missing data could be attributed to a variety of reasons. One is that PCI is based on visual pavement distress data collection, which is relatively easy for trained inspectors to understand and accurately record. This may explain the lower percentage of missing PCI data compared to missing IRI and SN data. The missing IRI and SN data could be due to shortage of personnel who were able to adequately operate the specialized equipment used including road laser profiler and falling weight deflectometer and accurately record required values. It may further have been due to equipment malfunction or breakdown during some days of measurement. Overall, logistical challenges including poor roads, insecurity in some regions of the country, and challenges in data storage and management leading to data loss or omissions, could also explain the missing data.

The study recognizes that while the condition of the Kenya paved road network was completely assessed in terms of the visual surface condition rating in RICS 2018, a richer evaluation would be realized from additionally using the IRI, SN and PCI parameters. The visual surface condition rating methodology is used in Kenya as a quick low-cost means of prioritizing pavement condition and provides a qualitative insight into the functional value of the pavement. It relies on trained inspectors to identify the percentage of potholes and cracks along a road section, the corresponding required maintenance interventions and to subsequently rate the pavement condition as detailed in Table 3¹⁶⁾. Nonetheless, PCI is calculated from visual survey of the number of more diverse pavement distresses and therefore provides a detailed assessment of both functional and structural aspects of the pavement. IRI also provides insight into functional pavement value in terms of the ride quality and comfort experienced by road users. SN on the other hand is an abstract value that measures the ability of the pavement to withstand anticipated axle loads. For each layer of the pavement, a characteristic value is determined by multiplying a layer coefficient representing the relative strength of its material, the layer thickness and a drainage coefficient representing layer loss of strength due to drainage charateristics and exposure to moisture saturation. The resulting layer characteristic values are then summed to obtain the SN¹⁷⁾. SN therefore evaluates the load carrying capacity and strength of pavement layers, giving a robust understanding of the structural integrity and stability of the pavement structure. Hence imputing the missing IRI, SN and PCI values to realize a complete dataset would facilitate more comprehensive downstream analysis to well understand performance of the paved road network in Kenya and aid more informed decisions on maintenance and rehabilitation, this further motivating the study.

(2) Examination of data distributions

The distributions of the observed IRI, SN and PCI data across road characteristics and visual surface condition ratings are presented in Figs. 2-4. From Fig. 2, it can be seen that observed IRIs exhibit the most variability across different road classes and are least variable across the visual surface condition ratings. IRIs also exhibit less spread from medians across asphalt and surface dressing surface types as well as busy and used traffic levels. Skewness of the data can be seen across several categories of the road characteristics with outliers also evident across some features. From Fig. 3, observed SNs exhibit significant variability across different road classes and regions. They are however less spread from the medians across asphalt, concrete and surface dressing surface types as well as busy, rare and used traffic levels. SNs are least variable across the visual surface condition ratings. Skewness of the data can be seen across several categories of the road characteristics with outliers also evident across some features. From Fig. 4, observed PCIs exhibit significant variability across all the different categories of road characteristics and visual surface condition ratings. Skewness of the data can be seen across several categories of the road characteristics and visual surface condition ratings though there are minimum outliers. Based on the above observations of spatial variability, proportionate stratified sampling of the data was carried out in order to improved accuracy of estimation of the population distributions of IRI, SN and PCI.

The IRI, SN, and PCI data were examined for conformity to five common probability distributions that suit engineering problems: normal, exponential, gamma, log-normal, and Weibull. This was in a bid to confirm the alignment of the data distributions with the choice of missForest for imputation. Each of the five distributions were fitted to each of the observed parts of the IRI, SN and PCI columns and kernel density estimation plots generated. The goodness of fit was assessed using the Kolmogorov-Smirnov (KS) test and the Akaike Information Criterion (AIC). A resulting p-value :< 0.05 from the KS test led to a rejection of the null hypothesis and conclusion that the observed data does not follow the specified theoretical distribution. Where the KS test did not determine the best fitting distribution, AIC was used to compare the probability distributions for the observed data. The distribution with the lowest AIC value was considered to be the most favourable, indicating the best balance between goodness of fit and complexity of the data distribution.

The kernel density estimation plots together with the parameters of the fitted distributions and the results of the goodness of fit tests for the variables IRI, SN and PCI are presented in Figs.5-7 and Tables 4-9. Fig.5 indicates that the observed IRI data are skewed to the right. The positions of the estimated parameters of the fitted distributions, as in Table 4, are represented by the dashed vertical lines overlaying the plot (red: normal distribution; green: exponential distribution; blue: gamma distribution; orange: log-normal distribution and black: Weibull distribution). From Table 5, the Weibull is the most favourable probability distribution for the observed IRI data having the lowest AIC value of the compared distributions, though none of the distributions gives a good fit from the results of the KS test since all p-values are ≤ 0.05.

Fig.6 indicates that the observed SN data are also skewed to the right. The estimated parameters of the fitted distributions are given in Table 6 and represented in Fig.6 similarly to the IRI case. From Table 7, the log-normal is the most favourable probability distribution for the observed SN data having the lowest AIC value of the compared distributions, though none of the distributions gives a good fit from the results of the KS test since all p-values are ≤ 0.05.

Fig.7 indicates that the observed PCI data are skewed to the left. The estimated parameters of the fitted distributions are given in Table 8 and represented in Fig.7 similarly to the IRI and SN cases. From Table 9, the Weibull is the most favourable probability distribution for the observed PCI data having the lowest AIC value of the compared distributions, though none of the distributions gives a good fit from the results of the KS test since all p-values are :< 0.05.

Overall it is clear that the distributions of the observed IRI, SN and PCI values are generally skewed and more complex than can be definitively represented by the fitted theoretical distributions. To impute missing data for these variables therefore, an imputation method suited to highly skewed data or a non-parametric imputation method such as missForest is suitable for obtaining accurate results.

(3) Investigation of missing data mechanisms

a) Little’s MCAR test

The commonly used Little’s test was applied in this study to diagnose the probability that data is MCAR. As explained by Li¹⁴⁾, the test checks whether there is a significant difference between the means of different missing value patterns in multivariate normal data. A likelihood ratio test is applied asymptotically based on a chi-squared distribution.

Assume a dataset Y (n × p) containing n observations and p variables has a multivariate normal distribution, i.e. Y_p~N(μΣ) . Let X_i = X_i,1, ... . , X_ip be the vector of values for observation i, X_obs,i, = X_obs,i1, ... . , X_obs,ip the vector of values of observed variables in case i, and r_i the indicator of missingness for observation i, such that r_i = 1 if X_i is missing and r_i = 0 if X_i is observed. This gives j distinct missing data patterns and the set D_j of missing data patterns for each j = 1, ... . , J . Also, μ_obs,j and Σ_obs,j are the vector of means and the covariance matrix respectively of observed variables in j. The test hypotheses are as follows:

where γ_obs,j is the vector of means of observed variables in j, but, unlike μ_obs,j, is distinct for each pattern of j. Further, if μ* is the maximum likelihood estimates of μ and y̅_obs,j is the sample mean of the observed values for pattern j, the following statistic asymptotically follows a X² distribution with degrees of freedom ∑ p_j − p:

where m_j is the number of cases in pattern j such that ∑m_j = n, μ*_obs,j = μ*D_j.

The test was implemented for each of the variables IRI, SN and PCI in R¹⁸⁾ using the "naniar" package and the "mcar_test" command¹⁹⁾. A sufficiently large likelihood statistic with a p-value :≤ 0.05 led to a rejection of the null hypothesis and conclusion that the data is not MCAR.

It is cautioned however that Little’s test should be used as only one piece of information mainly due to its assumption that data follows a normal distribution so that it is not robust to a deviation from it^6,¹²⁾.

b) MAR test by logistic regression

Given the inconclusiveness of Little’s test, logistic regression was further applied to diagnose the probability that data is MAR, in which case it would not be MCAR.

Consider a dataset Y (n × p) containing n observations and p variables. Suppose the variable X₁ has missing data and other variables X₂, X₃, ... . , X_p are fully observed. Let r₁ be the indicator of missingness in variable X₁ such that r = 1 for the missing case and r₁ = 0 for the observed case. A logistic regression model can be developed with r₁ as the dependent variable and X₂, X₃, ... . , X_p as the predictor variables as follows:

where (P (r₁ = 1 X₂, X₃, ... . , X_p) )denote the probabilities of missingness in X₁ given the predictor variables X₂, X₃, ... . , X_p, Logit(. ) is the log-odds function, β₀, β₁, β₂, ... . , β_p−1 and are the variable coefficients derived from the model.

Logistic regression models were developed in R using the "dplyr" package²⁰⁾ for missingness of each of the variables IRI, SN and PCI with road class, region, carriageway surface type, road usage and visual surface condition rating as predictor variables. The models also incorporated the observed parts of the other two variables with missing values in predicting each of the dependent variables. This was carried out to take into account all observations in the dataset that could be influencing missingness of data. The coefficients of the predictor variables obtained by each model were then subjected to statistical testing (Wald test inbuilt in the "dplyr" package). If at least one of the p-values associated with a variable coefficient was not significant (> 0.05), this led to an acceptance of the null hypothesis and conclusion that the data is MAR.

(4) Imputation using missForest

Based on the findings of the tests in sub-sections (2) and (3) above, the missForest technique was chosen and used for imputing missing values in the IRI, SN and PCI columns.

According to Bianchi’s²¹⁾ description, the initial step of the missForest algorithm involves replacing missing values with the average for continuous variables and the most frequent value for categorical variables. Then, by iteration, a random forest is trained on available observations and used to make predictions of missing values starting with the variable with fewest missing values. A series of random forests are generated until a stopping criterion is met.

As defined by Stekhoven and Buhlmann [2], consider X = (X₁, X₂, ... . , X_p to be a n × p - dimensional data matrix. For an arbitrarty variable X_j including missing values at entries ⊆ {1, ... . , n} the dataset can be separated into 4 parts:

1. the observed values of X_j denoted by ;

2. the missing values of X_j denoted by ;

3. the variables other than X_j with observations = {1, ... . , n} \ denoted by ; and

4. the variables other than X_j with observations denoted by

where is not completely observed since the index correspond to the observed values of X_j and likewise is not completely missing.

The algorithm, incorporating components of data matrix X and a stopping criterion γ is as represented below:

1. Initially replace missing values with the average for continuous variables and the most frequent value for categorical variables

2. k ← vector of sorted indices of columns in X w.r.t. increasing amount of missing values

3. while not γ do:

4. ← store previously imputed matrix

5. for j in k do:

6. fit a random forest:

7. predict using

8. ← update imputed matrix using

9. end for

10. update γ

11. end while

12. return the final imputed matrix X^imp

The stopping criterion γ is met as soon as the difference between the newly imputed data matrix and the previous increases for the first time with respect to both continuous and categorical variables, if present, represented by the following equations or as soon as a user specified maximum number of iterations is reached:

for the set of continuous variables N and

for the set of categorical variables F

where #NA is the number of missing values in the categorical variables.

The performance (predictive accuracy) of the imputation is assessed using the NRMSE for continuous variables, represented by the following equation:

where X^true is the complete data matrix and mean and var are the empirical mean and variance computed over the continuous variables only. For categorical variables, the proportion of falsely classified entries (PFC) is the performance evaluation metric. In both cases, values close to 0 indicate good performance and vice versa for values close to 1. The reliability of the imputation can also be assessed for individual variables, which may be useful in deciding the variables to employ in subsequent analyses²¹⁾. The mean squared error (MSE) is the performance evaluation metric for continuous variables in this case, represented by the following equation:

where n is the number of observations with available true values in the arbitrary variable X_j,, is the true value for observation i and is the imputed value for observation i. A lower value of MSE generally indicates better performance and vice versa for the variables under comparison. PFC remains the metric for categorical variables in variablewise assessment. The imputation was implemented in R using the inbuilt "missForest" package^22).

3. RESULTS AND DISCUSSIONS

(1) Missing data mechanisms

a) Little’s MCAR test results

The results of the Little’s MCAR test for each of the variables IRI, SN and PCI are presented in Table 10. It can be seen from the table that the p-values obtained from the test for each variable are < 0.05. This leads to a rejection of the null hypothesis with the suggestion that the missing data is not MCAR.

b) MAR test results

Figs.8-10 are plots of the p-values associated with the coefficients of the predictor variables in the logistic regression models developed for missingness of each of the variables IRI, SN and PCI. It can be seen from the figures that the p-values for several of the predictor variables are > 0.05. This leads to an acceptance of the null hypothesis with the implication that the missing data is MAR and is related to observed data within the dataset. This result also conclusively indicates that the missing data is not MCAR.

c) MNAR test results

Table 11 presents the results of the chi-square MNAR test for each of the variables IRI, SN and PCI. It can be seen from the table that the p-values obtained from the test for each variable are < 0.05. This leads to a rejection of the null hypothesis with the implication that the MNAR mechanism may not be ignored.

The above results overall give strong evidence that the missing data in the subject dataset is MNAR.

(2) Results of missForest imputation

Having determined that the IRI, SN and PCI data in the paved dataset follow highly skewed complex distributions and that the missing values in these variables are MNAR, the missForest technique was considered suitable for imputing the missing data. As was discussed in Chapter 1, this technique has stood out in many studies as a highly accurate imputation method. missForest’s non-parametric nature and ability to handle complex non-linear interractions as found by Stekhoven and Buhlmann¹⁰⁾ justifies its application to this dataset. Furthermore, missForest was developed as a technique for handling mixedtype data¹⁰⁾ such as the mix of categorical and continuous variables in the study dataset.

The incomplete pavement condition data were imputed by running the missForest algorithm on 19 different combinations of impute variables and predictor variables from the study dataset as shown in Table 12. A graphical representation of the NRMSE values obtained from the missForest analyses are shown in Fig.11. Generally, the graph reveals that imputation of IRI, SN, and PCI missing values together generated smaller NRMSEs (0.08 - 0.29) compared to imputation using pairs of variables (0.20 - 0.67) or individual variables (≥ 0.75). Moreover, imputation of each of the variables separately generated the largest NRMSEs out of all 19 combinations. This could suggest that the predictor variables (road characteristics and visual surface condition rating) have significant correlations or interdependence with IRI, SN, and PCI, which the missForest algorithm takes into account in making accurate predictions of missing values. It is also suggested that IRI, SN, and PCI themselves are strongly correlated, and considering them together realizes a more robust imputation than considering them apart. The smallest NRMSE (0.08) of the 19 combinations is achieved when imputing IRI, SN, and PCI based only on visual surface condition rating. This is compared to NRMSE values of 0.12 and 0.13 generated from imputing IRI, SN, and PCI based on road characteristics alone and a combination of visual surface condition rating and road characteristics, respectively. The visual surface condition rating may possess specific correlations or nuanced relationships with IRI, SN, and PCI that road characteristics lack. This is plausible considering that, like IRI, SN, and PCI, visual surface condition rating intrinsically reflects the state of the pavement compared to road characteristics that have an indirect influence. For instance, road class contributes to the pavement state through the design and construction standards attributed to the different classes, different carriageway surface types contribute through their inherent durabilities and environmental factors specific to a region affect the deterioration of the pavement.

Fig. 12 presents the variablewise performance of the missForest imputation in terms MSE for combinations Nos. 1, 2, 6 and 13 from Table 12, which are the only ones that involve imputing IRI, SN and PCI together. Only these combinations are presented following the previous observation that imputation of the three variables together is more accurate than imputing pairs or each of the variables separately.

As stated in Chapter 2 Subsection (4), checking the variablewise performance of missForest imputation in terms of MSE in addition to the complete matrix imputation performance in terms of NRMSE serves to comprehensively evaluate the reliability of the imputation. It is observed that the variablewise imputation performances of SN and PCI for the four combinations follow similar patterns to those of the complete matrix imputation performances for the same combinations presented in Fig.11. Like in the case of the complete matrix, the smallest MSEs for SN and PCI were realized from imputing IRI, SN and PCI based only on visual surface condition rating. The MSEs for IRI however do not follow this pattern, having almost similar values across the four combinations of impute and predictor variables.

The above results suggest that the imputation was more reliable in the case of the SN and PCI variables compared to the IRI variable. This could be attributed to the fact that the missing percentage of IRI is quite high in the dataset, at almost 90%. While Ge et al¹¹⁾ found that missing rate is not a main factor affecting imputation performance, this study suggests that an extreme missing rate in a variable could mean that it cannot be imputed for further analysis or significantly advanced imputation methods would be required to accurately impute it. Hence from this analysis, only the imputed SN and PCI data can be used for subsequent analysis while the IRI data would have to be excluded.

Overall, the imputation points to visual surface condition rating as the most important of all the predictor variables in imputing missing IRI, SN and PCI values. This is significant because the visual survey methodology in the Kenyan context is easy and cheap to implement in terms of technical knowhow as well as financial and other resources such as equipment. That it can be used to predict superior pavement condition parameters such as IRI, SN, and PCI in terms of imputation of missing values means that, even without committing the full resources that would be needed to completely measure these parameters for the entire road network, a comprehensive evaluation of the condition of the network is still possible.

4. SIGNIFICANCE AND IMPLICATIONS

This study aimed to provide insight into the important data considerations in the process of imputing missing values in a pavement inventory and condition dataset and by applying the missForest imputation technique to the Kenya RICS 2018 paved road network dataset. The dataset contains road characteristics variables - road class, region, carriageway surface type and road usage - as well as pavement condition variables - visual surface condition rating, IRI, SN, and PCI. The IRI, SN and PCI columns contained missing values, which were the focus of imputation. It was recognized that imputing the missing IRI, SN, and PCI data to realize a complete dataset would provide a rich basis for evaluation of the health of the Kenya paved road network, especially because the dataset would then have comprehensive functional and structural pavement condition measures, aiding more informed decisions on maintenance and rehabilitation. This further motivated the study.

The pre-analysis examination of IRI, SN, and PCI data distributions as well as evaluation of the missing data mechanism to confirm applicability of the miss- Forest technique to imputing missing values in the study dataset provided a demonstration of how one may approach selection of a suitable imputation method. The performance of missForest with different data combinations from the study dataset was found to be different. This demonstrated the value of exploring different data relationships where one has to carry out multivariate imputation with multiple predictors such as in the case of the Kenya RICS dataset. This also serves to highlight and adopt the combination(s) that produce the best imputation. The concept of missForest assesses imputation performance on two fronts: the complete matrix performance and variablewise performance. The study results demonstrated the importance of checking both perfomances lest important aspects of the imputation are camouflaged by only checking the complete matrix performance leading to wrong conclusions.

The study finding that visual surface condition rating is the most important of all the predictor variables in imputing missing IRI, SN, and PCI values is a useful one, particularly in the case where resources for road network pavement condition assessment are scarce, like in most developing countries. In these cases, one could well aim for completely evaluating the road network using the low-cost visual survey means together with measuring the superior condition parameters such as IRI, SN and PCI on a sample of the network. Relying then on the power of missing data imputation, the visual survey could provide a base for accurately filling in the absent superior condition information to realize a complete and comprehensive database for subsequent pavement and network performance analysis, being confident that the analysis would yield meaningful results for informed pavement management decisions.

5. CONCLUSIONS

Building on previous studies that have been insightful to understanding the theories of data missingness and the considerations of choosing imputation methods, this study sought to present an implementation framework that practitioners such as road network managers could execute in imputing missing pavement condition data with high accuracy towards improving PMS. The study realized that there is a gap with regards to such an implementation framework and endeavoured to contribute to it. It was also aimed that the framework presented and its application to actual road data in this study would clarify the proper use of machine learning algorithms in data imputation. The study approach would provide respite particularly for developing countries that are still struggling with implementing accurate methodologies for collecting and storing PMS data, so that they can realize comprehensive data for subsequent performance evaluation and modelling pavement management strategies.

The distributions of IRI, SN, and PCI variables in the Kenya RICS 2018 paved road network dataset obtained from the Kenya Roads Board were examined. The dataset also contained road characteristics variables - road class, region, carriageway surface type and road usage - as well as the pavement condition variable - visual surface condition rating. The IRI, SN, and PCI columns contained missing values and the study aimed to impute them in a bid to obtain a rich dataset for subsequent comprehensive pavement evaluation. Hence, an investigation of the mechanisms of missing data was also conducted. The two analyses provided a basis for confirming the applicability of the missForest technique that was selected for imputing the missing data.

Five probability distributions: normal, exponential, gamma, log-normal, and Weibull were fitted to the observed parts of each of the variables and goodness of fit assessed using the KS test and the AIC. It was found that while the SN data most likely followed the log-normal distribution, the IRI and PCI data most likely followed the Weibull distributions. Little’s MCAR test returned a p-values < 0.05 for each of the variables IRI, SN, and PCI, indicating that the missing data is not MCAR. MAR test by logistic regression relating missingness of each of the variables IRI, SN, and PCI with road class, region, carriageway surface type, road usage and visual surface condition rating as predictor variables and the models also incorporating the observed parts of the other two variables with missing values in predicting each of the dependent variables showed that p-values for several of the predictor variables were > 0.05. This indicated that the data was MAR. A sensitivity MNAR test by cross-tabulation proposed in the study to determine if there is an association between combined missingness of the variables IRI, SN, and PCI with missingness in each of the variables returned p-values < 0.05 revealing that the MNAR mechanism may not be ignored. It was concluded that the IRI, SN, and PCI data distributions are generally skewed and complex. With these data also being MNAR, it was concluded that missForest imputation is applicable in view of its non-parametric nature and ability to handle complex non-linear interractions in addition to findings of previous studies that it is highly accurate even for MNAR data.

Complete matrices imputation performance in terms of NRMSE found that the combination of the impute variables IRI, SN and PCI with the predictor variable visual surface condition rating was the most accurate for imputation of missing values in the study dataset compared to 18 other combinations of impute and predictor variables. It was concluded that, with low-cost visual pavement condition survey of an entire road network and measurement of superior condition parameters such as IRI, SN, and PCI on a portion of the network and subsequent imputation, may provide a sufficiently accurate rich dataset for further pavement performance evaluation and enabling more informed management decisions.

Nonetheless, variablewise imputation performance in terms of MSE with missForest for the study dataset revealed that the imputation was reliable for SN and PCI variables but not for IRI variables, attributed to a missing percentage of almost 90% in the IRI variable compared to 71.7% and 47.8% missing percentages in the SN and PCI variables respectively. It was concluded that an extreme missing rate in a variable renders it non-imputable for further analysis or requiring significantly advanced imputation methods accurately impute it. It was further concluded that only the imputed SN and PCI data in this study can be used for subsequent analysis.

The study has sufficiently demonstrated the important data aspects in the imputation process and the methodology may be used by road network managers as a guide to data-driven imputation of incomplete pavement data to improve PMS. This is particularly useful to developing countries where accurate data collection and storage practices are still not well-developed, with PMS hence highly prone to the problem of missing data.

ACKNOWLEDGMENT

This research was supported by a scholarship for road asset management from the Japan International Cooperation Agency. The authors are also grateful to the Kenya Roads Board for providing the research data.

References

1) You, J., Ellis, J. L., Adams, S., Sahar, M., Jacobs, M. and Tulpan D. : Comparison of imputation methods for missing production data of dairy cattle, 31 July 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1751731123002185. [Accessed December 2023].
2) Umar, N. and Gray, A. : Comparing single and multiple imputation approaches for missing values in univariate and multivariate water level data, Water, Vol. 15, No. 8, 21 p., 2023.
3) Petrazzinni, B. O., Naya, H., Bello, F. L., Vazquez, G. and Spangenberg, L. : Evaluation of different approaches for missing data imputation on features associated to genomic data, BioData Mining, Vol. 14, No. 1, 13 p., 2021.
4) Diouf, S., Deme E. H. and Deme A. : Imputation methods for missing values: the case of Senegalese meteorological data, African Journal of Applied Statistics, Vol. 9, No. 1, pp.1245-1278, 2022.
5) Marcelino, P. : Improved methods for the imputation of missing data in pavement management systems, 18th Annual International Conference on Pavement Engineering, Asphalt Technology and Infrastructure, Liverpool, 2019.
6) Grace-Martin, K. : How to diagnose the missing data mechanism, 2019. [Online]. Available: https://www.theanalysisfactor.com/missing-data-mechanism/. [Accessed December 2023].
7) 7] Van Buuren, S. : Concepts of MCAR, MAR and MNAR, Flexible imputation of missing data, Boca Raton, Chapman & Hall/CRC, 2018.
8) Misztal, M. : Comparison of selected multiple imputation methods for continuous variables - Preliminary simulation study results, Folia Oeconomica, Vol. 6, No. 339, 26 p., 2018.
9) Hong, S. and Lynn, H. S. : Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction, BMC Medical Research Methodology, Vol. 20, No. 1, 12 p., 2020.
10) Stekhoven, D. J. and Buhlmann, P. : MissForest-non-parametric missing value imputation for mixed-type data, Bioinformatics, Vol. 28, No. 1, pp. 112-118, 2012.
11) Ge, Y., Li, Z. and Zhang, J. : A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods, Scientific Reports, Vol. 13, No. 1, 13 p., 2023.
12) Rouzinov, S. and Berchtold, A. : Regression-based approach to test missing data mechanisms, Data (Internet), Vol. 7, No. 16, 28 p., 2022 (cited 2024 Jan 1).
13) Heymans, M. W. and Twisk, J. W. : Handling missing data in clinical research, Journal of Clinical Epidemiology, Vol. 151, pp. 185-188, 2022.
14) Li, C. : Little’s test of missing completely at random, The StataJournal, Vol. 13, No. 4, pp. 795-809, 2013.
15) Kenya Roads Board : State of our roads 2018: Summary report on road inventory and condition survey and policy implications, Kenya Roads Board, Nairobi, 2019.
16) Kenya Roads Board : Road inventory and condition survey (RICS) data collection manual, Kenya Roads Board, Nairobi, 2020.
17) Pavement Tools Consortium : Pavement Design - What’s my structural number?, n.d. [Online]. Available: https://pavementinteractive.org/pavement-design-whats-my-structural-number/. [Accessed February 2024].
18) R Core Team : R: A language and environment for statistical computing, Vienna, Austria: R Foundation for Statistical Computing, 2021.
19) Tierney, N. and Cook, D. : Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations, Journal of Statistical Software, Vol. 105, No. 7, pp. 1-31, 2023.
20) Wickham, H., Francois, R., Henry, L. and Muller, K. : dplyr: A Grammar of data manipulation, 2022. [Online]. Available: https://dplyr.tidyverse.org/. [Accessed January 2024].
21) Bianchi, B. : Application of the MissForest algorithm for imputation in the Survey on Income and Living Conditions, 7 October 2022. [Online]. Available: https://unece.org/sites/default/files/2022-10/SDE2022_S4_Switzerland_Bianchi_AD.pdf. [Accessed January 2024].
22) Stekhoven, D. J. : Using the missForest Package, 13 May 2011. [Online]. Available: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/missForest_1.2.pdf. [Accessed January 2024].

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）