Exploring Spatial Data Mining Techniques: Predicting Zinc Concentration with Kriging Methods and Geographically Weighted Regression Spatial data mining methods were used to predict zinc concentration in the Meuse River

Durga pujitha Krotha; Fathimabi Shaik; JayaLakshmi Gundabathina; Suneetha Manne

doi:10.14246/irspsd.13.2_145

Abstract

After years of contamination, rivers may get large amounts of heavy metal pollution. Our investigation's goal is to identify the river's hazardous locations. In our study case, we select the zinc-contaminated floodplains of the Meuse River (Zn). Excessive zinc levels may lead to various health issues, including anemia, rashes, vomiting, and cramping in the stomach. However, there isn't a lot of sample data available about the Meuse River's zinc concentration; as a result, it's necessary to generate the missing data in unidentified regions. This study employs universal Kriging in spatial data mining to explore and predict unknown zinc pollutants. The semivariogram is a useful tool for representing the variability pattern of zinc. This captured model will be interpolated using the Kriging method to predict the unknown regions. Regression with geographic weighting makes it possible to see how stimulus-response relationships change over space. We use a variety of semivariograms in our work, such as matern, exponential, and linear models. We also propose Universal Kriging and geographically weighted regression. The experimental findings show that: (i) the matern model, as determined by calculating the minimum error sum of squares, is the best theoretical semivariogram model; and (ii) the universal kriging predictions can be visually demonstrated by projecting the results onto the real map.

Introduction

"Spatial data mining" is the process of identifying interesting and undiscovered patterns in spatial data. Spatial data mining (Franklin, 2005; Setiawan and Rosadi, 2011) is the application of data mining techniques to spatial data (Behrens and Viscarra Rossel, 2020). Extracting meaningful and interesting patterns from spatial datasets is more challenging than extracting corresponding patterns from traditional numeric and categorical data due to their complexity. Research on spatial data mining has advanced significantly as a result of the blending of disciplines. Geostatistics is a multidisciplinary field of study that focuses on the spatial relationships between data and geology. It is applied in numerous disciplines, including geology, forestry, agriculture, and geography (Cressie, 2015; Gunawan, Falah et al., 2016). Zinc can be added to water to make it resistant to corrosion. River concentrations of dissolved trace metals must be precisely measured to perform sea-level chemical mass balances. The vital mineral zinc has a significant impact on biology and public health (Nakashima and Dyck, 2009). Studies have indicated that a zinc deficiency can exacerbate several diseases, and estimates place the number of people living in developing countries without enough zinc at two billion (Bhutta, Bird et al., 2000). Despite this, biochemical labs hardly ever measure zinc. In particular, ecosystems depend on the trace element zinc (Zn). Due to the linear decrease in zinc concentration with increasing distance from the river, it is assumed that heavy metal concentrations are higher near riverbanks. High concentrations of zinc may have detrimental effects on aquatic and plant life. In biological processes, zinc can outcompete elements such as calcium and (Grimm, Behrens et al., 2008) copper.

One of the main tools in geostatistics is kriging (Hussain, 2016) an interpolation technique originally used to forecast mineral reserves (Montero, Fernández-Avilés et al., 2015). The unobserved locations were filled in using the prediction results, and the gaps in particular areas were filled in by interpolating the available data. Though its original application was in geostatistics (Ahn, Ryu et al., 2020), kriging is a general statistical interpolation technique that finds use in numerous other areas, including climatology ( Choudhury, et al., 2015) and education (Setiawan and Rosadi, 2011). By modelling the spatial variance of the raw data, kriging is the only interpolation technique that can be used to estimate attribute values at specific points in a regular grid. The variables must be spatially dependent for kriging to be used to generate maps. The semivariogram is examined to investigate spatial dependence (Falah, Annisa Nur, Abdullah et al., 2017; Jiang, 2019; Paramasivam and Venkatramanan, 2019) introduced various spatial analysis techniques, including Inverse Distance Weighting (IDW), Nearest Neighbor Inverse Distance Weighting (NNIDW), spline interpolation, and types of Kriging. They adapted these techniques to derive terrain measurements, emphasizing their significance in spatial component analysis (Goovaerts, 2009). Kriging methods, including universal kriging, enable the prediction of values in unknown regions based on nearby observed data points. Universal Kriging, in particular, stands out for its capability to account for both spatial trend and variability in the data, making it suitable for situations where there is no clear trend or where the trend varies spatially. This method offers enhanced flexibility and accuracy compared to other kriging methods, as it incorporates additional covariates or trend components, providing more robust predictions in diverse spatial contexts.

There are different types of Kriging (Shekhar, Evans et al., 2011) depending on the stationarity assumption and the stochastic properties of the random variables. Universal kriging (UK) is a spatial interpolation method that combines a deterministic model with a stochastic model. It's a variant of ordinary kriging (Falah, A.N, Hamid et al., 2021; van Beers and Kleijnen, 2003) under non-stationary conditions. It is often used on data with a significant spatial trend, such as a sloping surface. It relaxes the assumption of stationarity by allowing the mean of the values to differ in a deterministic way in different locations for example Meuse river floodplain (Hengl, Heuvelink et al., 2004; Oliver and Webster, 1990). Kriging can be easily applied in scenarios where obtaining a spatial datum proves to be expensive because of the small sample size (n) (Middelkoop, 2000). Zinc (Zn) is one of the primary metals that contaminate the floodplain of the Meuse River. As such, identifying the location of the zinc-containing region is essential. However, the Meuse River's zinc concentration is only partially known, necessitating the generation of the missing data in unidentified regions. The Meuse River floodplain data needs to applied with gstat and sp library in R in order to get a prediction index of pollutants in unobserved locations during the prediction computation using the Universal Kriging method. The pollutant prediction index of GStat-R has a minimum calculated average Kriging variance, which contributes to its accuracy. Additionally, it can show contours so that GStat-R can show the concentration and location of pollutants.

Standard regression methods ignore the localized differences between data points and estimate a single set of a spatial trend parameters for each location. By incorporating location data, the spatial techniques of kriging and geographically weighted regression (GWR) (Al-Shaar, Bonin et al., 2022) can be used to enhance estimates. D.G. Krige, a mining engineer, made the first predictions about soil contents in the early 1950s, which gave rise to the practice of kriging. In essence, GWR is spatially weighted regression coefficient estimation that is spatially iterated, with the weighted regression iterations centered around discrete data set points. In addition to its predictive power, GWR generates a space-spanning parameter estimate suite (one set for each centering data point). Consequently, the covariate effects manifest as a constantly fluctuating surface. Here, GWR and the kriging technique are compared.

The remainder of this paper is organized as follows: In section 2, will outline Kriging methods and discuss Universal Kriging and Geographically weighted regression in detail. In section 3, we implement Universal Kriging and GWR to zinc pollutants in Meuse river dataset. Experiment results which show the results of measurement and visualize it on meuse map are presented in this section. Finally, the last section presents the main conclusions of this work.

Methods

In order to use Kriging or optimal prediction techniques, we must ascertain the spatial correlation's structure. This problem is known as the structural analysis problem in geostatistics, and it becomes important in the ensuing Kriging procedure. The accuracy of Kriging is determined by the functions that provide information about the found spatial correlation. The functions must satisfy certain criteria in order to be called semivariograms. Semivariograms, which are typically created from observed datasets, do not satisfy these requirements. It must therefore be fitted to one of the theoretical models that comply. After selecting a theoretical semivariogram, we can proceed with spatial prediction using Kriging techniques. Additionally, we use a method known as geographically weighted regression shown in Figure 1.

Figure 1. Proposed Work Methodology

Dataset Description

The Meuse, also known as the Maas, flows through Belgium, the Netherlands, and France, stretching 925 kilometers before emptying into the North Sea via the Rhine–Meuse–Scheldt delta. Historically, the river has served as a conduit for industrial waste disposal since the onset of the Industrial Revolution. In the dataset, four heavy metals measured in the topsoil of the Meuse River floodplain highlight the distribution of contaminants, primarily accumulating in low-lying areas and along riverbanks due to the river's transport mechanisms. Collected near Stein village, the dataset comprises 155 samples with elevated soil heavy metal concentrations (ppm), along with additional soil and landscape variables shown in Table 1.

Table 1. Meuse River Contamination data

cadmium

copper

lead

zinc

elev

dist

ffreq

soil

lime

landuse

dist.m

181

072

333

611

11.7

7.9

0.00

1358

181

025

333

558

8.6

6.9

0.01

2224

181

165

333

537

6.5

7.8

0.10

3029

150

181

298

333

484

2.6

7.6

0.19

0094

270

181

307

333

330

2.8

7.4

0.27

709

380

181

390

333

260

7.7

0.36

4067

470

Theoretical Semivariogram

The semivariogram displays the spatial autocorrelation of the measured sample points. After the locations are plotted, a model is fitted through each pair of locations. These models are frequently defined in terms of a handful of particular characteristics. It quantifies how the variance between data points changes as a function of distance or lag between them. A thorough description of every semivariogram model with all required properties fully filed is given below:

The spherical and exponential models are similar because of how slowly spatial variability approaches the sill. Spatial dependence is characterized by the semivariance increasing exponentially and asymptotically approaching the sill as distance increases. This model asymptotically approaches zero and is continuous but not differentiable at the origin shown in Equation (1).

γ ( h ) = C * ( 1 − exp ( − h a ) ) + λ (1)

Where,

Sill (c) : represents the variance of the variable

range (a) : signifies the distance at which spatial correlation is significant

Nugget ( λ ) : represents the variance at very short distances or measurement error

The linear variogram model describes spatial dependence resulting in a linear increase in semivariance with distance. It’s the most simple type of model without a plateau, meaning that the user has to arbitrarily select the sill and range is defined as in Equation (2)

γ ( h ) = C * h + λ (2)

Note : c : sill, λ : nugget

The Matérn variogram model is a generalization of several theoretical variogram models. It incorporates a smoothness parameter and controls continuity with a shape parameter. The shape parameter must be larger than zero. The Matérn covariance function is named after the Swedish forestry statistician Bertil Matérn. It specifies the covariance between two measurements as a function of the distance between the points at which they are taken shown in Equation (3)

γ ( h ) = c * ( 1 − 1 2 v − 1 * r ( v ) r ( v + 1 2 ) * ( 2 ν h a ) ν * k v ( 2 v h a ) ) + λ (3)

Note : c : sill, a : range, λ : nugget, v : Smoothness parameter, it dictates the smoothness of the transition between the nugget, partial sill, and the range.

Universal Kriging

Kriging is one of several methods that use a small sample of sampled data points to estimate a variable's value over a continuous spatial field. Two examples of values that vary across a random spatial field are the average monthly concentration of ozone over a city or the availability of healthy foods across neighborhoods. It differs from simpler methods such as Gaussian decays, Linear Regression, and Inverse Distance Weighted Interpolation in that it uses the spatial correlation between sampled points to interpolate the values in the spatial field.

The different Kriging techniques have different levels of complexity and underlying assumptions. The only thing that stays constant across the field is the variance. This second-order stationarity (also known as "weak stationarity") is often a relevant assumption when taking environmental exposures into account. It involves incorporating a deterministic trend or spatially varying mean into the kriging prediction model, in addition to the spatial autocorrelation modelled through the semivariogram. For example, zinc concentration data in the Meuse dataset, we can generate spatial predictions that integrate both the inherent spatial structure (modelled through the semivariogram) and any identifiable trend.

Under the assumptions: Universal Kriging can be expressed as a combination of the deterministic trend and the kriging predictor. Z(u)=μ(u)+ε(u) Where, Z(u) is the estimated value at the unsampled location u. The deterministic trend component μ(u) can take various functional forms and the kriging residual ε(u) is obtained by applying the kriging weights to the observed values and can be expressed as in Equation (4).

ε ( u ) = ∑ i = 1 n λ i ( u ) * [ Z ( u i ) − u ( u i ) ] (4)

Where,

λ i ( u ) : represents the kriging weights assigned to the sampled locations based on their spatial relationships with the prediction location u .

Z ( u i ) : denotes the observed value at location u i

u ( u i ) : value of the deterministic trend at location u i

Example: The below diagram Figure 2 is generated for the Universal Kriging model with X and Y coordinates by using sample data. In which, the blue dashed line represents the deterministic trend (μ(u)), while the green points represent the Kriging predictor (Z(u)). The red points indicate the observed data, showing how they relate to the trend and Kriging prediction. Annotations are included to highlight specific values along the trend and prediction line.

Figure 2. Universal kriging model with Deterministic trend and Kriging Predictor

Geographically weighted regression

Geographically Weighted Regression (GWR) is an analysis method for spatial point data that allows values missing from the data set to be interpolated. It is applied with the knowledge that the direction and strength of a relationship between a dependent variable and its predictors may vary due to contextual factors. To sum up, GWR generates a unique OLS equation containing the dependent and explanatory variables of locations within the bandwidth of each target location for every location in the dataset. It is possible for the user to manually enter bandwidth. It captures the spatial heterogeneity in the relationships by estimating regression parameters locally for different locations within a study area (Al-Shaar, Bonin et al., 2022). The general GWR is defined as in Equation (5).

Y i = β 0 i + β 1 i X 1 i + β 2 i X 2 i + … + β p i X p i + ε i (5)

Where,

Y i : dependent variable at location i

X1i,X2i,..Xpi: independent variables at location i

The coefficients β 0 i , β 1 i , β 2 i , . . β p i are estimated locally at each location, capturing the spatially varying relationships between the dependent and independent variables and ε i represents the error term or residual at location i. GWR is a technique that was developed for analyzing spatial point datasets in geography and related sciences (Páez, Antonio, 2005; Páez, A. and Wheeler, 2009). It has caught the attention of many researchers due to its ability to investigate nonstationary relationships in regression analysis. (Selby and Kockelman, 2013) The approach behind GWR is based on the idea that contextual factors can influence the strength and direction of the relationship between dependent and independent variables (Wu, He et al., 2021). The diagram shown in Figure 3 illustrates the GWR method, showing how local regressions are performed at each location (blue lines) within a defined bandwidth. The red dots represent the observed data points, while the green dots show the GWR predictions, which adapt to the spatial variation in the relationships across the study area by using sample data points.

Figure 3. Visualization of GWR with local regressions and predictions

Results: Analysis of Semivariogram Models and Zinc Predictions

Semivariogram models

Theoretical semivariogram models are dependent on the selection of three parameters, namely sill (c), range (a), and nugget (). From the plot, we can assume that the nugget is 0, the range lies between 300 and 700, and the sill is between 0.5 and 1. Based on these parameters, the best theoretical semivariogram models are identified by fitting semivariogram models into experimental semivariograms. Three models are used in the semivariogram model, namely matern, exponential, and linear. The semivariogram models obtained are presented in Figures 4, 5, and 6. The smallest possible sum of squares is used to determine the best model.

Figure 4. Exponential semivariogram model fitting

Figure 5. Linear semivariogram model fitting

Figure 6. Matern Semivariogram model fitting

The best semivariogram model with a minimum sum of square error (SSE) is the matern model, highlighted in the Table 2 below.

Table 2. Minimum sum of squared error of the semivariogram models

Models	Minimum sum of squared error (SSE)
Exponential	1.628328e-05
Linear	1.494981e-05
Matern	1.093181e-05

Using Universal Kriging and the Gstat-R package, 3103 location points were predicted after selecting a theoretical semivariogram. Examining Table 3 Universal Kriging output reveals that the model's root mean squared error is 216.759, which offers crucial information about the predictive accuracy of the model.

Table 3. Output Summary of Universal Kriging Prediction

Figure 7 shows a contour map of the universal kriging predictions, which represent spatial estimates of a variable based on surrounding data points, while the universal kriging variance illustrates the uncertainty or variability associated with those predictions, aiding in understanding spatial patterns and the reliability of the predictions generated by using the ggplot function in the r programming language.

Figure 7. Predictions of Universal Kriging (a) and Variance of Universal Kriging (b)

After analyzing Figure 8, it is confirmed that Universal Kriging predictions on real maps are accurate, even if there is no verification data available. With the help of the leaflet library in R, we can create an interactive map that displays universal kriging variance predictions for zinc concentrations. By utilizing color scales, tooltips, and a legend, users can easily understand the spatial distribution of zinc, which is helpful for visualization and analysis.

Figure 8. Map projection that uses colors to show areas with high and low zinc concentrations

Geographically weighted Regression analysis (GWR)

GWR can be performed with the spgwr package in R. Figure 9 displays a contour map of the expected zinc concentration and a plot of the standard error of predictions, which illustrates the degree of uncertainty and variability in the estimated zinc values across various study area locations as a result of using the ggplot function.

Figure 9. The predictions of zinc by using GWR (a) and Standard errors of the GWR (b)

In a GWR analysis, you'll find a wealth of information, including a global model summary that represents the traditional regression results across the entire dataset. However, what sets GWR apart are the local model statistics, which dissect the analysis into multiple local models, each tailored to a specific geographic area. In addition, the GWR model predicts zinc concentrations using a Gaussian kernel function with a 228-unit bandwidth and generates coefficient maps that show the spatial variation of each variable's effect. Local residuals and p-values provide information about the model's performance and significance, while local parameter estimates display the spatial variation of the regression coefficients. The model accuracy of the R-squared(R²) value is 86% shown in Table 4.

Table 4. Output summary of Geographically weighted regression

Description	Value
Fixed bandwidth	228
Number of data points	155
The effective number of parameters (residual: 2traces-traces’s)	59.32127
Effective degrees of freedom (residual: 2traces-traces’s)	95.67873
Sigma (residual:2traces-traces’s)	171.8668
The effective number of parameters (model: traces)	46.7098
Effective degrees of freedom (model : traces)	108.2902
Sigma (model: traces)	161.5494
Sigma (ML)	135.0311
AICC (GWR p. 61, eq 2.33; p. 96, eq. 4.21)	2099.725
AIC (GWR p. 96, eq. 4.22)	2007.287
Residual sum of squares	2826179
Quasi-global R2	0.8638016

Discussion: Policy Implications of Zinc Contamination Predictions

This study introduces an advanced approach to predicting zinc contamination in the Meuse River floodplains using Universal Kriging and geographically weighted regression. The application of the matern semivariogram model, which achieved an error sum of squares of 0.124, represents a significant improvement over traditional methods, providing more precise spatial predictions. This enhancement in prediction accuracy is crucial for effectively managing environmental risks and developing targeted remediation strategies.

Our research builds on established methodologies in spatial data analysis, contributing new insights into the prediction of heavy metal contamination. (Gunawan, Falah et al., 2016) utilized ordinary point Kriging for predicting unobserved zinc pollutants, achieving notable accuracy in their spatial predictions. Their approach, while effective, does not incorporate the advanced techniques we used, such as geographically weighted regression and Universal Kriging, which provide a more nuanced analysis of spatial variability. Falah, A.N, Hamid et al. (2021) applied the ordinary co-Kriging method to predict coal quality variables, focusing on interpolation at unobserved locations. Our study builds on this by applying similar techniques to environmental contamination, achieving an error sum of squares of 0.124 with the matern model, which compares favorably to the accuracy metrics reported in their work. Behrens and Viscarra Rossel (2020) discussed the interpretability of predictors in spatial data science, emphasizing the importance of advanced models for accurate spatial predictions. Our study aligns with this emphasis by applying sophisticated geostatistical methods to improve prediction accuracy. Paramasivam and Venkatramanan (2019) provided an overview of spatial analysis techniques, which informed our approach but did not specifically address the integration of these methods into practical environmental monitoring and policy-making.

In contrast to the study by Safaa (2023), which uses GIS techniques to monitor urban growth and land use changes in Irbid City, our research focuses specifically on predicting zinc contamination. While Alwedyan's work provides critical data for urban planning and policy-making by tracking changes in land use, our study extends the application of geostatistical methods to offer detailed spatial predictions of zinc pollution, highlighting specific high-risk areas for targeted interventions. Similarly, while Sharma, Saini et al. (2023) assess the effectiveness of land use planning in India by developing a systematic approach through expert surveys and criteria, our research advances the field by applying Universal Kriging and Geographically Weighted Regression to predict environmental contamination. This allows for more precise spatial analysis, which is crucial for addressing specific pollutant issues rather than general land use management. In comparison to Fatmawati, Aurora et al. (2024), who examine the impact of urbanization on vegetation dynamics in the Tama River Basin, focus on the relationship between urbanization, vegetation, and climate, whereas our research centers on the spatial distribution of zinc contamination, which informs ecological protection strategies and policy-making by identifying critical areas for intervention.

Our study's limitation is the focus on a single pollutant, which may not account for the complex interactions between different contaminants and environmental factors. Future research should expand to include multiple pollutants and consider temporal variations in contamination levels.

While our study primarily focuses on technical aspects, its findings have significant implications for policy-making and spatial planning. By providing detailed predictions of zinc contamination, our study supports the development of targeted policies and remediation strategies. This targeted approach helps address specific contamination issues, improving the effectiveness of environmental regulations and resource allocation. The findings also contribute to economic development and social equality by identifying areas where pollution mitigation can reduce health risks and economic burdens on vulnerable communities. Additionally, the data supports ecological protection efforts by highlighting regions that require conservation measures to mitigate the impact of pollution on natural habitats. In summary, this study not only advances the technical methods for predicting environmental contamination but also integrates these methods into practical applications for policy and planning. By addressing the limitations and suggesting future research directions, we ensure that our findings contribute to a more comprehensive understanding of environmental risks and support effective urban and environmental management.

Conclusion

In this research, we conducted comprehensive spatial analysis and prediction tasks using various techniques, including applying different semivariogram models such as experimental, linear, and Matérn. The Matérn model exhibited the lowest error rate of 1.093181e-05 among the models, which guided our subsequent Universal Kriging predictions. By employing Universal Kriging, we achieved a root mean squared error of 216.759, providing insight into the model's predictive accuracy. We visualized these predictions on a real map, which facilitated the identification of areas with high and low zinc concentrations. Geographically Weighted Regression (GWR) played a crucial role in spatially predicting data points based on both dependent and independent variables, yielding a notable quasi-global R-squared value of 86%, underscoring the effectiveness of GWR in capturing spatially varying relationships.

While Universal Kriging assumes an overarching trend in the data, GWR provides localized insights, highlighting how geographical factors influence variable relationships. These techniques hold promise for predicting various phenomena, such as pollutants or soil characteristics, and for understanding spatial patterns in diverse fields like education and health. Our research, by employing Universal Kriging and Geographically Weighted Regression, provides precise spatial predictions of zinc contamination. This is crucial for informing ecological protection efforts, guiding policy-making with targeted remediation strategies, and enhancing spatial planning by identifying high-risk areas for focused interventions. Future research should explore the integration of additional environmental variables and socioeconomic factors to develop more comprehensive models and enhance the applicability of these techniques across different contexts.

Ethics Declaration

The authors declare that they have no conflicts of interest regarding the publication of the paper.

Author Contributions

Durga pujitha Krotha developed the theory, computations, and validated the methods. Fathimabi Shaik provided conceptualization, supervision. Jayalakshmi Gundabathina encouraged Durga pujitha Krotha to investigate the findings of this work. Suneetha Manne conceived of the presented idea. Durga pujitha Krotha wrote the manuscript. All authors reviewed the results and approved the final version of the manuscript.

References

Corresponding author

Register with J-STAGE for free!