論文ID: D-24-00033
Machine learning (ML) techniques have been increasingly used to estimate crop yields at scales ranging from on-site to global. Since ML techniques are data-driven approaches, it is empirically known that the performance of a specific ML algorithm depends on the manner in which the training dataset is compiled. However, few studies have quantitatively evaluated the performance. In this study, global rice yields were estimated through a random forest (RF) methodology. Performance dependency of RF on training data was examined by a comparison of estimated yields using different training datasets covering different yield ranges and geographical extents. First, 14 explanatory variables collected from different sources (satellite vegetation, meteorology, and geographical location data) were used for building RF regressors. The crop calendar was determined from a combination of satellite vegetation and crop model simulation. Next, RF regressors were trained to give census-based rice yields (used as reference yields) from training datasets of the 14 explanatory variables. By applying the RF regressors to validation datasets, misfits between estimated and the reference yields were evaluated. RF reproduced rice yields, but the accuracy depended on the training data. Yields beyond the yield range of the training data could not be reproduced by RF. This indicates that the yield range of the training data determined the possible range of estimated yield. Among the 14 variables, geographical coordinates (longitude and latitude) ranked the highest importance, i.e., played a crucial role in estimating yields. The RF regressors built from the 14 variables outperformed those built only from the geographical coordinates in accuracy but with limited advantage. We concluded that (1) choosing training data to cover all possible yield ranges of the target rice-cropping areas was crucial for accurate yield estimation using RF and (2) incorporating satellite and simulation data was advantageous for building high-performance RF regressors.
Crop production is one of the most fundamental datasets in agricultural censuses because it directly affects the food supply that sustains healthy human lives and activities in good nutritional conditions. Since crop production varies from place to place in relation to meteorological conditions, environments, and cultural practices, global crop data have been widely used in agricultural studies (e.g., Monfreda et al., 2008). Currently, crop production data at high geographical resolution, finer than 10 km (e.g., Wu et al., 2023), has become important. Moreover, to achieve higher geographical resolution, understanding the spatially varying nature of crop production is necessary, raising technological challenges in the development of new analytical schemes.
Several data sources are available for understanding crop production; however, they have both merits and demerits. For example, the agricultural censuses or statistical yearbooks compiled by national/local governments provide postharvest crop production data, which was collected systematically from each administrative unit. Normally, reliable official data are provided, but the quality of the data from some developing countries is questionable (Iizumiet al., 2021; Kim et al., 2021). Since census data are not provided through the Internet in some countries, data access is another problem. Enhancing the geographical resolution to be finer than the geographical extent of administrative units becomes also difficult. Secondly, optical and radar sensors onboard Earth observation satellites routinely scan Earth’s surface and retrieve information on crop vegetation, which can be empirically converted into crop production (e.g., Lobell et al., 2015; Azzari et al., 2017; Deines et al., 2021). This approach has merits in obtaining global real-time crop growth at a fine spatial resolution (e.g., subkilometers), but optical observations are often affected by cloud cover. Moreover, numerical simulations using a crop model are feasible for obtaining the geographical distribution of crop production. The accuracy of numerical simulations has been improved with the aid of reliable meteorological forcing datasets; however, the parameters of the crop models must be well tuned prior to the analysis. Ecophysiological processes that are not incorporated into a model can emerge as an error source (Li et al., 2019b). Among all data sources, only census data are directly linked to crop production observations, whereas satellite-based and simulated data include crop production estimates.
Spatial homogeneity is also of great concern in the compilation of global data. The geographical distribution of crop fields is intrinsically heterogeneous because crops are grown under environmentally (e.g., climate, water availability, and soil fertility) and socially (e.g., labor forces, infrastructure, and proximity to food-demanding populated areas) suitable conditions. Additionally, large geographical gaps between the locations of available census data increase the heterogeneity. To fill in these data gaps, spatial interpolation or extrapolation must be relied upon; however, these techniques are at risk of magnifying unintentional errors, especially at long distances from available data sites.
Recently, cutting-edge machine learning (ML) techniques as well as traditional interpolation techniques have been used for spatial interpolation. ML is advantageous for analyzing complex systems and seeking robust solutions using a data-driven approach. Nowadays, various ML techniques run on personal computers as software packages. A random forest (RF) is one such technique that is advantageous for obtaining accurate solutions through training regressors (internal calculation units to be adjusted by building a series of regression trees from training data, i.e., supervised data). Since robust solutions can be found even in nonlinear systems, RF has widely spread among researchers in various fields of study (Prasad et al. 2006; Li et al., 2011; da Silva Júnior et al., 2019).
In estimating crop yield or production, ML has rapidly gained popularity, especially in these several years (Clark et al., 2023). Target areas range widely, from field-scale (Mariano and Mónica, 2021; Pinto et al., 2022; Amankulova et al., 2023), multiple states (Sakamoto, 2020; Schwalbert et al., 2020; Wang et al., 2020; Medina et al., 2021; Bowden et al., 2023; Dhillon et al., 2023), country-level (Cai et al., 2019; Folberth et al., 2019; Gómez et al., 2021), to the global scale (Jeong et al., 2016; Vogel et al., 2019). In most of these studies, multiple data sources (e.g., remote sensing, weather) as well as yield data were incorporated into ML by making the most of its ability of finding solutions for nonlinear systems. RF is the most frequently used ML technique for yield prediction (Clark et al., 2023) and promising in terms of high accuracy (Jeong et al., 2016; Gómez et al., 2021). RF was also utilized in agricultural studies other than crop production, e.g., cropland distribution (Phalke et al., 2020). However, the nature of RF interpolation techniques for yield estimation remains unclear, e.g., whether the accuracy depends on the choice of training data.
To summarize these circumstances, a combination of multiple data sources and an RF-based spatial interpolation methodology were considered keys to the compilation of crop datasets with higher accuracy at a finer geographical resolution than conventional datasets. To realize the compilation of yield data along this course, understanding the nature of the analysis methods is important because this has not yet been fully disclosed in the literature. Additionally, crop data are spatially heterogeneous. Therefore, it would be beneficial to provide readers with precautions and lessons learned by analyzing such data.
In this study, the crop yield was estimated using a combination of different data sources (census data, satellite data, and crop model output) and evaluated for their performance. Furthermore, the performance of RF in estimating crop yield as a complete spatial interpolation case was compared with the performance of RF incorporating other variables that affect yield. To focus on the analysis target, the spatially varying nature of crop yield was prioritized over temporally varying yields using yield data averaged over multiple years. In particular, rice yield was focused on because rice is one of the major staple food crops and paddy fields are heterogeneously distributed worldwide. This study aimed to enrich practical expertise and collect technical tips for the application of RF in agricultural studies.
The main tasks of this study were twofold: (1) to examine the feasibility of rice yield estimation using RF with a combination of multiple data sources (schematics presented in Fig. 1); and (2) to understand the applicability of RF as a spatial interpolation method. During the training process, RF regressors were built to estimate the rice yield (objective variable) from a set of candidate factors (explanatory variables) that contribute to the yield. In the validation process, regressors were applied to the validation data and compared with the estimated rice yield. By evaluating the performances of various types of regressors, expertise on building regressors was gained, and it was clarified which variables could achieve better performance.
For task (2), the performances of RF using 14 explanatory variables (i.e., solving a problem as shown in Fig. 1) and RF using only geographical coordinates (i.e., solving a problem as a complete spatial interpolation) were compared (see Section 2.9).
Fig. 1.Schematic diagram of the analysis used in the study. Abbreviations (intgr.) and (avr.) represent integrated and averaged variables, respectively, throughout the growing period of rice. See Sections 2.2-2.8 for details.
In this study, global census-based rice data compiled by Monfreda et al. (2008) were used as a reference (i.e., both as supervised data in the training process and as reference data to be compared with the estimated data in the validation process). Monfreda et al. (2008) provided global grid data on both yield and harvesting area ratio (R) for 175 distinct crops, including rice, based on census data around 2000 (average of 1997-2003; i.e., the data were treated as a single snapshot at 2000), with a spatial resolution of 5 min in longitude and latitude (hereafter, called the Monfreda cell). They used census data of local administrative units up to one level below the national level for 150 countries and up to two levels below the national level for 73 countries. Thus, a single yield value was saved for multiple Monfreda cells belonging to the same local administrative unit. Since rice was harvested multiple times per year, the harvested areas were summed up. Thus, R can sometimes exceed 100%.
In this manner, the yield data saved in adjacent Monfreda cells were not always independent of each other. In the case of an administrative unit with a vast geographical extent which covers multiple Monfreda cells, rice-growing environments (e.g., climate, soil, and water availability) largely change within the unit. This might deteriorate the RF performance because regressors would be trained to irrationally output the same yield value among cells under different environmental conditions. We considered that it would be adequate to choose only one representative Monfreda cell from each administrative unit and to train RF regressors using a collection of representative Monfreda cells over the globe.
Representative Monfreda cells were chosen in two steps: (1) First, the geographical extent (denoted as E) with the same rice yield value was identified. This E probably corresponds to an administrative unit. (2) Among Monfreda cells with various R values in each E, a cell with the highest harvest area ratio (
Global crop growth has been monitored every 16 days with an EVI observed by the satellite-board Moderate Resolution Imaging Spectroradiometer (MODIS) at approximately 500 m resolution (hereafter referred to as an EVI cell). The MOD13A1 Version 6.1 dataset (Didan et al., 2015) was used in this study.
Data from five-years (2000-2004) were used in this study. Ideally, the data period must match that of the reference (i.e., Monfreda’s) yield data (typically 1997-2003). However, the MOD13A1 Version 6.1 EVI dataset covered only the period after 2000. Additionally, occasionally missing data were observed during the initial phase of the data period. To secure enough valid observations to dampen the year-to-year variation in rice yield, the analysis period was set to 2000-2004.
A Savitzky-Golay filter (Savitzky and Golay, 1964; some coefficients were corrected by Steinieret al., 1972) was used in a seven-point (i.e., the center point and three points on one side) quadratic polynomial form to smooth the EVI time series. The Savitzky-Golay filter is advantageous for retrieving vegetation signals from observed EVI time series (Chen et al., 2004). No filter was applied for geographical extent.
2.4 Crop model dataFor the crop model simulation, the Crop Yield Growth Model with Assumption on Climate and Socioeconomy (CYGMA) (Iizumiet al., 2017) was used. CYGMA was run globally with a meteorological forcing dataset using the WATCH Forcing Data methodology applied to ERA5 reanalysis data (WFDE5) (Cucchi et al., 2020) at a 0.5° resolution. Notably, the geographical resolutions of CYGMA and WFDE5 were the coarsest among the datasets used in this study; therefore, the same CYGMA output values were used for multiple Monfreda or EVI cells belonging to the same CYGMA cell. In the model simulation, uncertainties in the crop calendar can affect the simulation results (i.e., crop growth and production); considering the uncertainties, 73 runs were conducted by changing the sowing dates from the day of year (DOY) = 1 (January 1st) to DOY = 365 (December 31st) by five days, as in Iizumiet al. (2021), and one from them was selected (see Section 2.5). The model outputs the yield, biomass, flowering date, harvest date, and five stress factor indices (cold stress, heat stress, nitrogen shortage, water shortage, and water excess). Two parallel sets of CYGMA simulations were performed for each CYGMA cell under rain-fed and irrigated conditions, and one of them was selected according to the cropland type (Section 2.6).
For comparison with the Monfreda yield data, various output variables were simply averaged over 2000-2004, which is similar to the EVI data period (Section 2.3). As some variables have nonlinearity to yield, the validity of simply averaging the indices over the years might be arguable. For example, stress indices are sometimes rather multiplicative than additive in nature. Under compound effects of multiple stresses (e.g., heat and drought) or extremely high stresses, yield loss will be larger than a simple sum of stress indices. However, strict evaluation of such nonlinear responses is very complicated, a simple arithmetic mean was adopted for simplicity in this study.
2.5 Determination of the growing periodIn principle, the growing period of rice can be determined from the EVI time series by reading the rising and falling times from or onto the base level. However, this is impractical because the rising and falling profiles cannot be sharp owing to time filtering and observation noise. In contrast, the peak timing at which the maximum EVI was recorded could be determined more precisely than the rising/falling timing, even for time-filtered profiles. In this study, with the help of multiple runs of CYGMA simulations, the growing period (the period from sowing to harvesting) was determined in a hybrid manner using both satellite observations and simulations (see Fig. 2).
First, among the 73 CYGMA runs of different sowing dates, one was sought such that the flowering date calculated with the CYGMA model was closest to the maximum EVI date, because the maximum EVI is expected to be achieved at the flowering date of rice. Second, the sowing date was read from the run settings, and the harvest date was recorded for each run. These data were determined from the EVI time series.
Fig. 2.Scheme for determining growing period with a combination of satellite observation and crop model simulation. The sowing and harvest dates of observed EVI are regarded as those of a simulation run of which the flowering date is closest to the maximum EVI date.
Training and validation datasets must be independent each other. In this study, we discriminated between training and validation data by the value of
Practically, the values of
Therefore, the discrimination of data by the value of
A training dataset was prepared in two steps: (1) screening Monfreda cells by setting the lowest threshold of
then, several training datasets were prepared by varying
Next, the non-cropland EVI cells were excluded using the dataset called Global Rainfed, Irrigated, and Paddy Croplands (GRIPC) (Salmon et al., 2015). Notably, 400 EVI cells were present in a single Monfreda cell because geographical resolutions of EVI and Monfreda cells were approximately 500 m and 10 km, respectively. In GRIPC, croplands were classified into three categories (rainfed, irrigated, and paddy croplands) at approximately 500 m resolution, matching the EVI resolution. EVI cells classified into any of the three cropland categories were selected (i.e., EVI cells classified as non-cropland were excluded) for further analysis. Ideally, selecting only paddy croplands would be suitable for rice analysis; however, in practice, perfectly discriminating paddy cropland cells from wetland/marsh cells is a difficult task in land use classification using satellite data because both land uses are inundated for a certain period of the year and have an annual cycle of plant growth. Furthermore, upland rice is grown in rain-fed environments. By considering the risk of misclassification of croplands, a safe margin was considered by using GRIPC data as a “cropland” mask to exclude non-cropland cells (such as urban buildings, forests, and water surfaces).
The GRIPC dataset was also used to discriminate between the rainfed and irrigated CYGMA runs (Section 2.4). For EVI cells classified as irrigated or paddy croplands, the outputs of irrigated CYGMA runs were used for the analysis, whereas for EVI cells classified as rainfed croplands, the outputs of rainfed CYGMA runs were used.
Fig. 3.Geographical distribution of (a) rice yield data and (b)-(e) representative cells (red dots) at different threshold levels of
The R package randomForest (Liaw and Wiener, 2022) version 4.7-1.1 for random forest was used, which was originally developed by Breiman (2001) (See also Breiman and Cutler, 2004). Regressors were built to estimate Monfreda’s yield as an objective variable from the 14 explanatory variables listed in Table 1. (A unique exception that regressors were built from two explanatory variables will be written in Sect. 2.9.) Hyperparameters used in this study were same as the default settings (i.e., ntree=500, mtry=floor(14/3)=4 as regression, nodesize=5, and nPerm=1).
Table 1. Variables used in random forest in this study. One objective and 14 explanatory variables were used. Variables averaged, summed, or integrated over the growing period determined in Section 2.5 were indicated in the second column.
Variables expected to affect rice yield were selected as explanatory variables. Although some variables might be correlated with each other, RF can find robust solutions. Two importance metrices were introduced to judge which explanatory variable played crucial roles in yield estimation. (1) IncNodePurity measures an increment of information entropy by changing node splitting of regression tree from its ideal case. (2) IncMSE measures an increment of error by changing values of each explanatory variable. If an explanatory variable of interest is important, both metrics become high for the variable because a small change in the variable gives a large impact, i.e., a large increase in the metrics.
2.8 ValidationTo check the performance of the trained regressors, validation datasets should be prepared as independent data from the training datasets (Sect. 2.6). In this study, two validation methodologies were introduced.
First, out-of-bag (OOB) validation is a common validation methodology for RF. Since RF randomly chooses data from a training data set for building regressors, a certain part (approximately, a third) of the training data set is never used for it. Subsets of training data which were used and not used in building regressors were called in-bag (IB) and OOB, respectively. OOB data satisfy independence from IB data and can be utilized for validation. The mean of squared residual and percent of variance explained are common OOB validation metrics (e.g., Liaw and Wiener, 2002). The former measures differences between prediction (averaged over regressors) and observation, whereas the latter gives the coefficient of determination R2 evaluated only from an OOB subset.
In this paper, we adopted another validation methodology using “complete strangers” to the training data because this methodology is considered to be more suitable for our study which ultimately aims compilation of global yield data under limited data availability. Aside from some data sets like Monfreda et al. (2008), it is practically difficult to collect yield data geographically uniformly from all over rice-planted areas in the world. Rather, it is common that yield data are available only for some top rice-producing countries or for some specific areas. Then, it is obliged to estimate yield at sites far apart from available training data sites. However, the OOB validation methodology, the geographical distribution of OOB data is much the same as that of IB data because both are extracted from the same original training data set. Consequently, OOB sites are distributed in the vicinity of IB. The validation score with the OOB might be higher than the actual performance. A stricter validation methodology would be preferred for our study.
This was ensured by using different ranges of
In summary, for validation, representative Monfreda cells with
In other words,
The performance of different combinations of datasets was evaluated by changing
Table 2. Experimental design matrix of this study. Geographical distributions (before applying cropland masks) are shown in designated figures. The numbers of valid data (i.e., valid EVI cells) given as training or validation data sets were shown in square brackets.
We prepared another set of RF regressors built only from the two explanatory variables of geographical coordinates (i.e., longitude and latitude) and compared the results with those for the 14 variables. RF regressors built only from geographical coordinates can be treated as a complete spatial interpolation (or extrapolation) problem. In fact, longitude and latitude have often been chosen as explanatory variables for spatial interpolation in the literature (e.g., Jeong et al., 2016; da Silva Júnior et al., 2019; Guilpartet al., 2022; Wu et al., 2023).
Figure S4 shows the fitting performance of the RF regressors at different levels of
Both the training and estimated rice yields were distributed over a broader range for lower threshold levels (Table 3 and Fig. S4). In particular, the upper bound of the estimated yield increased markedly as the threshold
Table 3. Monfreda yield range of training data. Geographical distributions (before applying cropland masks) are shown in designated figures.
The results of OOB validation (when regressors were built from the IB training data and applied to its OOB data) are shown in Table 4. Both the percent of variance explained, which approximately gives R2 (coefficient of determination), and root-mean-square errors (RMSEs) showed sufficiently practical performance. RMSEs slightly decreased with decreasing the threshold level like T30.
Table 4. Statistical summary of validation results when out-of-bag (OOB) validation data were applied. The yield range of the training data are listed in Table 3. The metrics %Var and RMSE give the percent of variance explained (which approximately gives R2, the coefficient of determination) and the root-mean-square error, respectively.
The validation of regressors’ performances (built from the training data and applied to the validation data at different levels of
Fig. 4.Validation performance of the regressors when different training and validation datasets were used. Panels (a) and (b) show the validation results among the different validation datasets when the regressors were built from the common training dataset
Table 5. Statistical summary of validation results when validation data with different levels of
The features of the scatter plots differed among the different validation datasets (Fig. 4(a) and (b)) and different training datasets (Fig. 4(b)-(d)); in particular, the latter was crucial and determined the overall accuracy. First, the training dataset (
Next, the validation data (
These results show that when the threshold levels for training
To investigate regional validation performance, we set two regions, one with low yield (near the boundary of Thailand-Cambodia) and the other with high yield (near the lower reach of the Changjiang (Yangtze) River, China) and extracted validation data from Fig. 4. The results are shown in Figs. S6 and S7. The distribution lay near the 1:1 line but stretched over a line with a gentler slope for higher threshold levels of
The importance of explanatory variables is a metric of their contribution to yield estimation. This information helps judge which explanatory variables should be incorporated in building regressors or how regressors can be simplified by reducing the number of variables without deteriorating the overall performance.
Figure 5 shows the importance of the explanatory variables with different threshold levels of
Fig. 5.Importance of explanatory variables with different threshold levels of
The regressors built only from geographical coordinates performed well (Fig. 6 and Table 5(b); see Fig. S8), with RMSEs of 1.211-1.771 t ha-1, which was slightly larger than the corresponding case of 14 variables (Fig. 4 and Table 5(a)). The data lay near the 1:1 line but were more broadly scattered than in the case of the 14 variables (Fig. 4(b) and (d)). The characteristics observed upon changing
This advantage of the case of the 14 variables over that of the geographical coordinates was also confirmed by direct comparison of absolute misfits in yield (absolute values of misfit calculated as estimated yield minus Monfreda yield). In the scatter plots (Fig. 7), data points were more densely distributed below the 1:1 line than above. The slope of the regression line fitting to the data was < 1. These results indicated that absolute misfits for the 14 variables were smaller than those for the geographical coordinates. Thus, incorporating a combination of different data sources (census, satellite, and crop model output) for building RF regressors had an advantage in accuracy. Note that the number of valid data for the 14 variables was smaller than that for the geographical coordinates because yield could not be calculated when any value of explanatory variables was missing.
Fig. 6.Validation performance of regressors built only from the longitude and latitude with different threshold levels of (a)
Fig. 7.Absolute misfits in yield (i.e., absolute values of misfit calculated as estimated yield minus Monfreda yield) using regressors built only from the geographical coordinates (horizontal axis) versus ones built from the 14 variables (vertical axis). This distribution relative to the 1:1 line gives which regressors were superior (i.e., with smaller misfits) in terms of accuracy. The validation data used here are V3010. The threshold levels for training data are (a)
An observed tip was noted when Figs. 4 and 6 were compared. Data points gathering in vertically aligned lumps were observed in Fig. 4, whereas no such lumps were observed in Fig. 6. The lumps were primarily attributable to input variables other than geographical coordinates into regressors: When the regressors built only from the geographical coordinates, the estimated yield values were similar for different EVI cells in a single Monfreda cell because differences in the geographical coordinates were too small to deviate from the yield values. Thus, the data points in a single Monfreda cell fell into the same point in Fig. 6. However, when the regressors built from the 14 variables were used, the estimates were different among EVI cells in a single Monfreda cell because of the different EVI values or growing periods. Consequently, the data in a single Monfreda cell spread vertically. Fig. 7 also showed similar vertically aligned lumps.
3.6 Performance depending on location or distance from training data sitesTo determine whether the RF performance depends on the location or distance from training data, Figs. 8 and 9 were plotted. The minimum distance from the nearest training data sites to the validation sites (Panel (a)) tended to be large for archipelagoes or near the peripheries of rice-cropping areas. The misfits for the cases of 14 variables and geographical coordinates (Panels (b) and (c), respectively) showed similar geographical patterns. Despite the similarities between Panels (b) and (c), the differences in absolute magnitude (absolute values of (c) minus absolute values of (b)) showed that positive values outnumbered negative ones (Panel (d)), indicating that regressors incorporating the 14 variables outperformed at a larger number of sites. The advantage for the case of the 14 variables was also shown in Section S1.2 of the Supplementary Material, regardless of the distance from the nearest training data sites.
Fig. 8.Geographical distribution of (a) the minimum distance from the nearest training data sites at validation sites, (b) misfits (estimated yield minus Monfreda yield) for the case of 14 variables, (c) misfits for the case of geographical coordinate, and (d) differences in absolute magnitude of misfit (i.e., absolute value of (c) minus absolute value of (b)) for T90V3010.
Fig. 9.Geographical distribution of (a) the minimum distance from the nearest training data sites at validation sites, (b) misfits (estimated yield minus Monfreda yield) for the case of 14 variables, (c) misfits for the case of geographical coordinate, and (d) differences in absolute magnitude of misfit (i.e., absolute value of (c) minus absolute value of (b)) for T30V3010.
A comparison of corresponding panels between Figs. 8 and 9 showed how differences in training datasets affected the results. The distance for T30V3010 was generally smaller than that for T90V3010 because the training data T30 were more widely distributed globally and more densely distributed over rice-cropping areas (panels (a)). Biases also geographically clustered, but with much smaller magnitude for T30V3010 than T90V3010 (panels (b)).
Areas with relatively large misfits (i.e., biases) geographically clustered (Figs. 8 and 9). In each cluster, the magnitude of bias was relatively homogeneous. For example, in Fig. 8(b), a couple of clusters with negative large biases were observed around 30°N in China, whereas clusters with positive large biases were observed in Madagascar, the Philippines, and areas around Cambodia. These geographically clustered biases were attributable to ill-performance of RF: within each cluster, estimated yield values had less variability than actual yields. In fact, Fig. S6(c) for areas around Cambodia and Fig. S7(c) for areas around 30°N in China showed that data stretched horizontally. These panels correspond to regional extraction of data for the case of T90V3010 (Fig. 8(b)). These biases also improved with decreasing
RF is a powerful analysis tool for obtaining a relatively robust solution because the resultant solution is averaged over an ensemble set of yield values, each of which is estimated using a regressor (a series of regression trees). An ensemble mean hardly produces an ‘outlier’ because it tends to lie closer to the center of the data distribution. Therefore, it is reasonable for the RF yield values to lie within the yield range of the training data (Section 3.3).
RF requires the training data to be an unbiased subset of all possible data to be representative of the overall distribution - at least, the training data to cover a possible range of yield. If the training data are biased, the estimation will also be biased, as discussed in Section 3.3. Therefore, this issue was addressed by considering the global nature of the data used in this study (Sections 3.1, 3.3, and S1.1). The training data, T90, were geographically concentrated at low latitudes in Southeast Asia, where the yield ranged from 1.3-6.8 t ha-1 (Table 3). Even if such regressors were applied to data at mid-latitudes, where the actual yield ranged more widely (e.g., 1 to over 10 t ha-1 for V3010), the estimated yield would be bound to the range of the training data (Table 5). Instead, training dataset T30, geographically distributed over both low- and mid-latitudes across continents, covering a wider yield range. Since such training data are better representative of a possible yield range, regressors built from the data are applicable to various sites on Earth. This implies that, for global yield estimation using RF, collecting a smaller amount of training data per site but at a larger number of sites will achieve a higher performance than collecting a larger number of training data per site but only at a smaller number of sites, even if the total amount of training data remains the same.
Therefore, the selection of training data plays a crucial role in determining the overall accuracy of the yield estimation using RF. This might be a pitfall of RF because one might misunderstand as if RF automatically finds a robust solution anytime and anyhow regressors were built. The geographical coverage of the training data was focused on the spatial interpolation problems. However, more attention should be paid to the range of objective variables (yield, in this case). Recent studies have reported the ability of RFs to extrapolate outside the training data range (Berezowski and Chybicki, 2018; Hengl et al., 2018; Li et al., 2019a; Goulart et al., 2021; Ikeda and Kusaka, 2021; Silva et al., 2023).
It should be kept in mind that even if the training data were improperly sampled from a possible yield range, no alerts or warnings would be issued during the RF calculation. Since the calculation would normally be terminated in appearance, one would hardly perceive a deterioration of the estimated results. In published papers, one can find results showing that estimated yields were unnaturally bounded (e.g., Fig. 4 of Wu et al., 2023). We have no confidence but guess that some of these results might be attributable to the yield range of training data. This finding on yield coverage for the RF model is essentially the same as the applicability of conventional statistical models (e.g., multiple linear regressions), which depends on the range of the training data (Lobell and Burke, 2010; Parkes et al., 2019). Special attention must be paid to the representativeness of the training data in advance.
Last but not least, appropriate validation methodologies for one’s purpose should be introduced. In this paper, validation performance was evaluated using two methodologies. As we expected, nominal values of performance metrics using the OOB data were superior to ones using data at different
Spatial interpolation has been widely used to fill the spatial gaps between available data points. Moreover, it works effectively when the yield data are densely distributed over a target area. Recently, the number of studies using the RF as an advanced type of spatial interpolation has increased in various fields (e.g., Leirvik and Yuan, 2021; Sekulićet al., 2021). The performances of RFs have also been investigated through cross-comparisons among various methodologies (Li and Heap, 2011; Ohashi and Torgo, 2012; Sekulićet al., 2020). However, their nature could not be understood completely, particularly in terms of crop production. Therefore, learning more about the nature of the examples and lessons in agricultural studies is important. Consequently, the findings of this study regarding RF spatial interpolation are discussed from this viewpoint.
Longitudes and latitudes are often used as explanatory variables. This is rational because these variables scored the highest in importance among the 14 variables (Fig. 5). In the real world, crop yield sometimes changes across geographical structures, such as topographic lineaments (e.g., mountain ranges and valleys) or geographical boundaries of soil classification. Since the geographical coordinates are independent of any geographical structure, spatial interpolation using these variables cannot reproduce the structures unless the data are sufficiently densely distributed around the structures. If some variables are sensitive to the structure, the performance improves.
To compile yield data at global to regional scales, spatial gaps in original yield data must be filled. If only the longitude and latitude are chosen as explanatory variables (Section 3.5), RF will act as spatial interpolation. However, in contrast to most classical spatial interpolation techniques (e.g., inverse-distance weighting interpolation, polynomial interpolation, spline interpolation) which smoothly fill spatial gaps with continuous functions, RF spatial interpolation discontinuously fills the gaps with some jumps in values. This nature was clearly observed when RF spatial interpolation (or extrapolation) was applied to the entire Earth as an extreme case, regardless of its validity or feasibility (Fig. S9). The yield was estimated at a finer geographical resolution for areas where the training data were densely distributed (Fig. 3), but at a coarser resolution for areas where less training data were distributed. As a result, the interpolated surface was divided by blocky artifacts sectioned parallel to the longitude and latitude lines. This was attributable to the RF algorithm because of the switching to different nodes of the regression trees at some longitudes and latitudes, as already reported (e.g., da Silva Júnior et al., 2019). Global-to-regional yield data compiled using RF are prone to produce meaningless boundary structures like Fig. S9 over areas where training data are sparsely distributed. Data producers should call users’ attention to such artifacts not to misinterpret them.
Li et al. (2011) compared various interpolation methodologies, including RF, and found that RF and its related methodologies outperformed the others. They also found that, for an extrapolation problem, the estimations of RF were better than those of other methodologies. However, their analysis settings were slightly different from those adopted in this study in the following manner: (1) As an object variable, they dealt with the mud content ratio in weighted percentage, strictly ranging from 0 to 100%, whereas this study dealt with rice yield, which was not clearly confined to any range. (2) Training data were used heterogeneously but geographically sufficiently covering the target area (Fig. 1 of Li et al. (2011)), whereas training data in this study were geographically partly covering the target area, particularly for smaller
Crop yield also changes with human factors, such as cropping maneuvers, infrastructure, and the choice of rice varieties. In fact, crop yields sometimes interrupt changes across the borders of administrative sectors. These effects were partly incorporated in the cases of 14 variables via explanatory variables (e.g., EVI, soil fertility, and water shortage). Consecutively, incorporating satellite- or simulation-based variables into spatial interpolation helps improve the performance. A hybrid data source (census data, satellite observations, and numerical simulations) in combination with RF, as in this study, also helps produce global crop yield data.
We consider that the methodology of this study can be applied to yield estimation of crops other than rice because of the advantages in (1) filling spatial gaps of data by interpolation, (2) dampening unwilling spatial instability, and (3) making spatial boundary structures. Regarding (1), collecting cultural data from all the croplands is unrealistic. Spatial interpolation methodologies will also work for other crops. For (2), any observation data contain errors. Besides, phenology independently determined at each EVI cells has no spatial linkage even between adjacent EVI cells. Such spatially high variability that does not reflect actual signals of crop yield should be removed. Finally, regarding (3), a combination of different data sources can produce geographical boundary structures, which cannot be obtained by a simple interpolation but only smoothing the surface. Since yield is a fundamental metric in agriculture, not only for food crops but also for cash crops (Saeidi et al., 2022), these advantageous factors will also be welcomed to studies on other agricultural crops.
4.3 Possible error sources and room for improvement in yield estimationSeveral possible sources of error existed in rice yield estimation. First, the contamination of the EVI signals and imperfect land masks can be sources of error. Unconsciously, regressors may be trained using cells in which less rice is planted. However, creating a perfect rice cropping map is almost impossible (Section 2.6). Since it is rare for rice to be planted exclusively on land cells, EVI signals are somewhat contaminated by reflections from other land uses within a cell. Even if only rice is cultivated in a single EVI cell, a mixture of different timings of rice sowing/transplanting can disturb peak EVI timing, leading to a misinterpretation of the rice growing period. These factors may lower the importance of satellite-derived variables, such as the EVI, in the RF (Fig. 5).
Second, the choice of explanatory variables may also deteriorate overall accuracy. Intrinsically, an RF methodology is robust against the choice of variables because of their importance in building regressors. However, Li et al. (2011) reported the possibility that the inclusion of ill-performing variables may deteriorate accuracy via node division processes in RF. Silva et al. (2023) pointed out that since RF performance varies depending on crops, countries, and seasons, a well-performed set of explanatory variables change accordingly, across time and space.
Estimation of rice production over a certain area is important for food supply, but we should consider how to handle invalid data and biases before that. Regarding the former, remind that rice production from a land cell is evaluated as the product of the yield, cell area, and harvest area ratio. If any value of these variables is invalid (e.g., missing), rice production will inevitably be invalid. EVI cells with ill-determined phenology (e.g., rice would not mature) are also invalid. Regarding the latter, RF-estimated yields have biases in some regions (Figs. 8 and 9). Spatial aggregation of rice production over a certain extent of areas will sum up not only signals but also biases. Biases with a same magnitude will never be cancelled out by summation - just only cumulate. However, since the magnitude of bias decreased by using geographically densely distributed training data (Fig. 9 with respect to Fig. 8), the latter will be solved by lowering the threshold
There is also some room for improvement in the crop models. Crop models cannot fully incorporate all the factors in conventional cultural practices, such as fertilization timing/intensity, irrigation timing/intensity, cultural techniques, rice varieties, natural disasters, crop diseases, and insect damage. These factors vary significantly from place to place, even among farmers in the same community. Moreover, multiple rice crops per year are commonly grown in tropical and subtropical regions. Different rice varieties are often chosen there for different cropping seasons due to consideration of local requirements such as water availability, temperature conditions, day length, and commercial markets (e.g., Nguyen and Kawaguchi, 2002; Biswas et al., 2019; Iwahashi et al., 2021). Different rice varieties have different growing periods owing to their different sensitivities to temperature and day length (e.g., Yoshida, 1981; van Oort et al., 2011; Fukui et al., 2015). Besides, farmers also choose different irrigation timings/intensities or fertilization timings/intensities in a suitable manner, depending on the season or site. Consequently, the actual farm-level yield reflects various factors (e.g., Silva et al., 2022). Such fine-tuned simulation settings have rarely been incorporated into global crop model simulations. If the crop model’s performance is improved, the importance of these variables in RF will increase accordingly.
The actual rice yields revealed by plot surveys sometimes differ substantially from place to place, even within an administrative unit. Since the performance was evaluated in comparison with the reference Monfreda yield by assuming a uniform yield value over the unit, this could be another source of error.
Practically, a trade-off occurs among the accuracy of the estimated yield, computational load, and labor for preparing training data. Therefore, to improve the overall accuracy, the training data should be sampled uniformly and unbiasedly using the widest possible range of data. However, an increasing amount of training data would increase the computational load, especially when building regressors in the RF. The strict selection of pure rice cells free from contamination by other land uses requires more laborious work for data qualification.
In this study, rice yield was estimated using a combination of different data products (census data, satellite observations, and crop model outputs) on a global scale using the RF methodology. Crop calendars were determined for each EVI cell (500 m resolution) using satellite data. Geographical coordinates (longitude and latitude) played crucial roles as explanatory variables in terms of their importance. However, other variables derived from satellite observations or crop model simulations also contributed to performance improvements.
Since RF is a cutting-edge methodology applicable to interpolation problems, lessons from analysis must be collected and utilized. The following remarks were deduced:
A possible yield range should be covered by the training data. Otherwise, the estimated yield will be wrongly confined to a narrow range, resulting in performance deterioration. This is because RF hardly estimates the yield values beyond the yield range of the training data. Therefore, more attention should be paid to the yield range and geographical coverage of the training data. If yield data are collected over areas where rice is densely planted, they will function less as training data for global estimation because of the limited yield range.
To further improve the yield estimation, the quality of satellite- or simulation-related variables should be improved to increase their importance (Fig. 5). By incorporating these variables, more benefit from their intrinsic nature can be achieved, i.e., the high geographical resolution of satellite-based variables, global coverage of numerical simulation, and determination of the crop calendar, which are all useful in agriculture. Accurate land-use data can also reduce errors by improving the training process.
To build versatile RF regressors, they should obtain high generalization performance. This is confirmed through validation. In agricultural application, it sometimes requires that yield must be estimated over a wide geographical extent from observation data over a limited geographical extent. However, a common OOB validation methodology might overrate the generalization performance because the geographical distribution of OOB validation data is much the same as that of in-bag training data. Our validation methodology based on
Although the rice yield was the focus of this study, the analytical method can be applied to other crops. Given the growing global population and the risk of unstable agricultural production due to meteorological factors and climate change, it is important to acquire accurate information on global crop production. Since yield estimation involves a very complex science linking nature (e.g., meteorological, geographical, and edaphic factors) and human activities (crop maneuvers, infrastructure, markets, and farmers’ choices), ML techniques have high potential for analyzing and solving agricultural problems of the present era.
This study was financially supported by JSPS KAKENHI (grant number: 22H00577). We acknowledge two anonymous reviewers for improving the manuscript. MODIS satellite data were downloaded from the United States Geological Survey (USGS) within the framework of Distributed Active Archive Centers directed by the National Aeronautics and Space Administration (NASA). Rice yield data compiled by Monfreda et al. (2008) were provided via EarthStat (http://www.earthstat.org/) operated by the Global Landscapes Initiative at the University of Minnesota’s Institute on the Environment and the Land Use and Global Environment lab at the University of British Columbia.