農業気象
Online ISSN : 1881-0136
Print ISSN : 0021-8588
ISSN-L : 0021-8588

この記事には本公開記事があります。本公開記事を参照してください。
引用する場合も本公開記事を引用してください。

Toward improving global rice yield reference dataset compilation through machine learning: Insights from training data selection and random forest analysis
Yoshimitsu MASAKIToshichika IIZUMIToru SAKAIKei OYOSHI
著者情報
ジャーナル オープンアクセス HTML 早期公開
電子付録

論文ID: D-24-00033

この記事には本公開記事があります。
詳細
Abstract

 Machine learning (ML) techniques have been increasingly used to estimate crop yields at scales ranging from on-site to global. Since ML techniques are data-driven approaches, it is empirically known that the performance of a specific ML algorithm depends on the manner in which the training dataset is compiled. However, few studies have quantitatively evaluated the performance. In this study, global rice yields were estimated through a random forest (RF) methodology. Performance dependency of RF on training data was examined by a comparison of estimated yields using different training datasets covering different yield ranges and geographical extents. First, 14 explanatory variables collected from different sources (satellite vegetation, meteorology, and geographical location data) were used for building RF regressors. The crop calendar was determined from a combination of satellite vegetation and crop model simulation. Next, RF regressors were trained to give census-based rice yields (used as reference yields) from training datasets of the 14 explanatory variables. By applying the RF regressors to validation datasets, misfits between estimated and the reference yields were evaluated. RF reproduced rice yields, but the accuracy depended on the training data. Yields beyond the yield range of the training data could not be reproduced by RF. This indicates that the yield range of the training data determined the possible range of estimated yield. Among the 14 variables, geographical coordinates (longitude and latitude) ranked the highest importance, i.e., played a crucial role in estimating yields. The RF regressors built from the 14 variables outperformed those built only from the geographical coordinates in accuracy but with limited advantage. We concluded that (1) choosing training data to cover all possible yield ranges of the target rice-cropping areas was crucial for accurate yield estimation using RF and (2) incorporating satellite and simulation data was advantageous for building high-performance RF regressors.

1. Introduction

Crop production is one of the most fundamental datasets in agricultural censuses because it directly affects the food supply that sustains healthy human lives and activities in good nutritional conditions. Since crop production varies from place to place in relation to meteorological conditions, environments, and cultural practices, global crop data have been widely used in agricultural studies (e.g., Monfreda et al., 2008). Currently, crop production data at high geographical resolution, finer than 10 km (e.g., Wu et al., 2023), has become important. Moreover, to achieve higher geographical resolution, understanding the spatially varying nature of crop production is necessary, raising technological challenges in the development of new analytical schemes.

Several data sources are available for understanding crop production; however, they have both merits and demerits. For example, the agricultural censuses or statistical yearbooks compiled by national/local governments provide postharvest crop production data, which was collected systematically from each administrative unit. Normally, reliable official data are provided, but the quality of the data from some developing countries is questionable (Iizumiet al., 2021; Kim et al., 2021). Since census data are not provided through the Internet in some countries, data access is another problem. Enhancing the geographical resolution to be finer than the geographical extent of administrative units becomes also difficult. Secondly, optical and radar sensors onboard Earth observation satellites routinely scan Earth’s surface and retrieve information on crop vegetation, which can be empirically converted into crop production (e.g., Lobell et al., 2015; Azzari et al., 2017; Deines et al., 2021). This approach has merits in obtaining global real-time crop growth at a fine spatial resolution (e.g., subkilometers), but optical observations are often affected by cloud cover. Moreover, numerical simulations using a crop model are feasible for obtaining the geographical distribution of crop production. The accuracy of numerical simulations has been improved with the aid of reliable meteorological forcing datasets; however, the parameters of the crop models must be well tuned prior to the analysis. Ecophysiological processes that are not incorporated into a model can emerge as an error source (Li et al., 2019b). Among all data sources, only census data are directly linked to crop production observations, whereas satellite-based and simulated data include crop production estimates.

Spatial homogeneity is also of great concern in the compilation of global data. The geographical distribution of crop fields is intrinsically heterogeneous because crops are grown under environmentally (e.g., climate, water availability, and soil fertility) and socially (e.g., labor forces, infrastructure, and proximity to food-demanding populated areas) suitable conditions. Additionally, large geographical gaps between the locations of available census data increase the heterogeneity. To fill in these data gaps, spatial interpolation or extrapolation must be relied upon; however, these techniques are at risk of magnifying unintentional errors, especially at long distances from available data sites.

Recently, cutting-edge machine learning (ML) techniques as well as traditional interpolation techniques have been used for spatial interpolation. ML is advantageous for analyzing complex systems and seeking robust solutions using a data-driven approach. Nowadays, various ML techniques run on personal computers as software packages. A random forest (RF) is one such technique that is advantageous for obtaining accurate solutions through training regressors (internal calculation units to be adjusted by building a series of regression trees from training data, i.e., supervised data). Since robust solutions can be found even in nonlinear systems, RF has widely spread among researchers in various fields of study (Prasad et al. 2006; Li et al., 2011; da Silva Júnior et al., 2019).

In estimating crop yield or production, ML has rapidly gained popularity, especially in these several years (Clark et al., 2023). Target areas range widely, from field-scale (Mariano and Mónica, 2021; Pinto et al., 2022; Amankulova et al., 2023), multiple states (Sakamoto, 2020; Schwalbert et al., 2020; Wang et al., 2020; Medina et al., 2021; Bowden et al., 2023; Dhillon et al., 2023), country-level (Cai et al., 2019; Folberth et al., 2019; Gómez et al., 2021), to the global scale (Jeong et al., 2016; Vogel et al., 2019). In most of these studies, multiple data sources (e.g., remote sensing, weather) as well as yield data were incorporated into ML by making the most of its ability of finding solutions for nonlinear systems. RF is the most frequently used ML technique for yield prediction (Clark et al., 2023) and promising in terms of high accuracy (Jeong et al., 2016; Gómez et al., 2021). RF was also utilized in agricultural studies other than crop production, e.g., cropland distribution (Phalke et al., 2020). However, the nature of RF interpolation techniques for yield estimation remains unclear, e.g., whether the accuracy depends on the choice of training data.

To summarize these circumstances, a combination of multiple data sources and an RF-based spatial interpolation methodology were considered keys to the compilation of crop datasets with higher accuracy at a finer geographical resolution than conventional datasets. To realize the compilation of yield data along this course, understanding the nature of the analysis methods is important because this has not yet been fully disclosed in the literature. Additionally, crop data are spatially heterogeneous. Therefore, it would be beneficial to provide readers with precautions and lessons learned by analyzing such data.

In this study, the crop yield was estimated using a combination of different data sources (census data, satellite data, and crop model output) and evaluated for their performance. Furthermore, the performance of RF in estimating crop yield as a complete spatial interpolation case was compared with the performance of RF incorporating other variables that affect yield. To focus on the analysis target, the spatially varying nature of crop yield was prioritized over temporally varying yields using yield data averaged over multiple years. In particular, rice yield was focused on because rice is one of the major staple food crops and paddy fields are heterogeneously distributed worldwide. This study aimed to enrich practical expertise and collect technical tips for the application of RF in agricultural studies.

2. Methodology

2.1 Framework

The main tasks of this study were twofold: (1) to examine the feasibility of rice yield estimation using RF with a combination of multiple data sources (schematics presented in Fig. 1); and (2) to understand the applicability of RF as a spatial interpolation method. During the training process, RF regressors were built to estimate the rice yield (objective variable) from a set of candidate factors (explanatory variables) that contribute to the yield. In the validation process, regressors were applied to the validation data and compared with the estimated rice yield. By evaluating the performances of various types of regressors, expertise on building regressors was gained, and it was clarified which variables could achieve better performance.

For task (2), the performances of RF using 14 explanatory variables (i.e., solving a problem as shown in Fig. 1) and RF using only geographical coordinates (i.e., solving a problem as a complete spatial interpolation) were compared (see Section 2.9).

Fig. 1.Schematic diagram of the analysis used in the study. Abbreviations (intgr.) and (avr.) represent integrated and averaged variables, respectively, throughout the growing period of rice. See Sections 2.2-2.8 for details.

2.2 Global yield data

In this study, global census-based rice data compiled by Monfreda et al. (2008) were used as a reference (i.e., both as supervised data in the training process and as reference data to be compared with the estimated data in the validation process). Monfreda et al. (2008) provided global grid data on both yield and harvesting area ratio (R) for 175 distinct crops, including rice, based on census data around 2000 (average of 1997-2003; i.e., the data were treated as a single snapshot at 2000), with a spatial resolution of 5 min in longitude and latitude (hereafter, called the Monfreda cell). They used census data of local administrative units up to one level below the national level for 150 countries and up to two levels below the national level for 73 countries. Thus, a single yield value was saved for multiple Monfreda cells belonging to the same local administrative unit. Since rice was harvested multiple times per year, the harvested areas were summed up. Thus, R can sometimes exceed 100%.

In this manner, the yield data saved in adjacent Monfreda cells were not always independent of each other. In the case of an administrative unit with a vast geographical extent which covers multiple Monfreda cells, rice-growing environments (e.g., climate, soil, and water availability) largely change within the unit. This might deteriorate the RF performance because regressors would be trained to irrationally output the same yield value among cells under different environmental conditions. We considered that it would be adequate to choose only one representative Monfreda cell from each administrative unit and to train RF regressors using a collection of representative Monfreda cells over the globe.

Representative Monfreda cells were chosen in two steps: (1) First, the geographical extent (denoted as E) with the same rice yield value was identified. This E probably corresponds to an administrative unit. (2) Among Monfreda cells with various R values in each E, a cell with the highest harvest area ratio ( R h i g h ) was chosen as a representative Monfreda cell. This means that rice was expected to be most densely planted in the representative Monfreda cell among the cells in E. Thanks to this selection, the Enhanced Vegetation Index (EVI) signals (see Section 2.3 in detail) recorded for the representative cell were expected to be least contaminated by other plants or other land uses. As a side note on (1), cells containing slightly different values were found along the boundary of an administrative unit. They were treated as cells in different administrative units in this study because it was unclear whether the values were actually recorded in original census data or artificially produced by weighted geographical interpolation of two yield values of adjoining administrative units.

2.3 Satellite data

Global crop growth has been monitored every 16 days with an EVI observed by the satellite-board Moderate Resolution Imaging Spectroradiometer (MODIS) at approximately 500 m resolution (hereafter referred to as an EVI cell). The MOD13A1 Version 6.1 dataset (Didan et al., 2015) was used in this study.

Data from five-years (2000-2004) were used in this study. Ideally, the data period must match that of the reference (i.e., Monfreda’s) yield data (typically 1997-2003). However, the MOD13A1 Version 6.1 EVI dataset covered only the period after 2000. Additionally, occasionally missing data were observed during the initial phase of the data period. To secure enough valid observations to dampen the year-to-year variation in rice yield, the analysis period was set to 2000-2004.

A Savitzky-Golay filter (Savitzky and Golay, 1964; some coefficients were corrected by Steinieret al., 1972) was used in a seven-point (i.e., the center point and three points on one side) quadratic polynomial form to smooth the EVI time series. The Savitzky-Golay filter is advantageous for retrieving vegetation signals from observed EVI time series (Chen et al., 2004). No filter was applied for geographical extent.

2.4 Crop model data

For the crop model simulation, the Crop Yield Growth Model with Assumption on Climate and Socioeconomy (CYGMA) (Iizumiet al., 2017) was used. CYGMA was run globally with a meteorological forcing dataset using the WATCH Forcing Data methodology applied to ERA5 reanalysis data (WFDE5) (Cucchi et al., 2020) at a 0.5° resolution. Notably, the geographical resolutions of CYGMA and WFDE5 were the coarsest among the datasets used in this study; therefore, the same CYGMA output values were used for multiple Monfreda or EVI cells belonging to the same CYGMA cell. In the model simulation, uncertainties in the crop calendar can affect the simulation results (i.e., crop growth and production); considering the uncertainties, 73 runs were conducted by changing the sowing dates from the day of year (DOY) = 1 (January 1st) to DOY = 365 (December 31st) by five days, as in Iizumiet al. (2021), and one from them was selected (see Section 2.5). The model outputs the yield, biomass, flowering date, harvest date, and five stress factor indices (cold stress, heat stress, nitrogen shortage, water shortage, and water excess). Two parallel sets of CYGMA simulations were performed for each CYGMA cell under rain-fed and irrigated conditions, and one of them was selected according to the cropland type (Section 2.6).

For comparison with the Monfreda yield data, various output variables were simply averaged over 2000-2004, which is similar to the EVI data period (Section 2.3). As some variables have nonlinearity to yield, the validity of simply averaging the indices over the years might be arguable. For example, stress indices are sometimes rather multiplicative than additive in nature. Under compound effects of multiple stresses (e.g., heat and drought) or extremely high stresses, yield loss will be larger than a simple sum of stress indices. However, strict evaluation of such nonlinear responses is very complicated, a simple arithmetic mean was adopted for simplicity in this study.

2.5 Determination of the growing period

In principle, the growing period of rice can be determined from the EVI time series by reading the rising and falling times from or onto the base level. However, this is impractical because the rising and falling profiles cannot be sharp owing to time filtering and observation noise. In contrast, the peak timing at which the maximum EVI was recorded could be determined more precisely than the rising/falling timing, even for time-filtered profiles. In this study, with the help of multiple runs of CYGMA simulations, the growing period (the period from sowing to harvesting) was determined in a hybrid manner using both satellite observations and simulations (see Fig. 2).

First, among the 73 CYGMA runs of different sowing dates, one was sought such that the flowering date calculated with the CYGMA model was closest to the maximum EVI date, because the maximum EVI is expected to be achieved at the flowering date of rice. Second, the sowing date was read from the run settings, and the harvest date was recorded for each run. These data were determined from the EVI time series.

Fig. 2.Scheme for determining growing period with a combination of satellite observation and crop model simulation. The sowing and harvest dates of observed EVI are regarded as those of a simulation run of which the flowering date is closest to the maximum EVI date.

2.6 Preparation of training data for random forest

Training and validation datasets must be independent each other. In this study, we discriminated between training and validation data by the value of R h i g h of representative Monfreda cell. That is, a range of R h i g h for validation data should be chosen not to overlap the range for the training data to guarantee the independence.

Practically, the values of R h i g h closely link to the yields and geographical locations (see details for Section S1.1 in Supplementary Information). The values of R h i g h determined for each E ranged almost from zero (no rice is planted in a Monfreda cell) to unity (rice is planted all over a Monfreda cell) worldwide. Generally, low latitudes including monsoon Asia tended to have high R h i g h values, whereas other areas tended to have low R h i g h values (Figs. S1(c) and S3). On the other hand, rice yields were generally low at low latitudes and gradually increased with the latitude (Fig. S1(e)). As a result, datasets with different ranges of R h i g h covered different yield ranges (Fig. S1(a)).

Therefore, the discrimination of data by the value of R h i g h meant two-fold in this study. First, this discrimination was roughly equivalent to the discrimination by the yield values. Second, the threshold level R h i g h T for R h i g h was set. If a higher R h i g h T value was set, the training data were almost entirely composed of rice fields; i.e., EVI signals were mostly from rice fields and less contaminated. On the contrary, if a lower R h i g h T value was set, areas where rice is less cultivated might be included at the risk of contamination of EVI signals from non-rice land surfaces.

A training dataset was prepared in two steps: (1) screening Monfreda cells by setting the lowest threshold of R h i g h , and (2) excluding EVI cells that were not croplands. First, representative Monfreda cells satisfying Eq. (1) were selected worldwide,

R h i g h R h i g h T                (1)

then, several training datasets were prepared by varying R h i g h T (Fig. 3).

Next, the non-cropland EVI cells were excluded using the dataset called Global Rainfed, Irrigated, and Paddy Croplands (GRIPC) (Salmon et al., 2015). Notably, 400 EVI cells were present in a single Monfreda cell because geographical resolutions of EVI and Monfreda cells were approximately 500 m and 10 km, respectively. In GRIPC, croplands were classified into three categories (rainfed, irrigated, and paddy croplands) at approximately 500 m resolution, matching the EVI resolution. EVI cells classified into any of the three cropland categories were selected (i.e., EVI cells classified as non-cropland were excluded) for further analysis. Ideally, selecting only paddy croplands would be suitable for rice analysis; however, in practice, perfectly discriminating paddy cropland cells from wetland/marsh cells is a difficult task in land use classification using satellite data because both land uses are inundated for a certain period of the year and have an annual cycle of plant growth. Furthermore, upland rice is grown in rain-fed environments. By considering the risk of misclassification of croplands, a safe margin was considered by using GRIPC data as a “cropland” mask to exclude non-cropland cells (such as urban buildings, forests, and water surfaces).

The GRIPC dataset was also used to discriminate between the rainfed and irrigated CYGMA runs (Section 2.4). For EVI cells classified as irrigated or paddy croplands, the outputs of irrigated CYGMA runs were used for the analysis, whereas for EVI cells classified as rainfed croplands, the outputs of rainfed CYGMA runs were used.

Fig. 3.Geographical distribution of (a) rice yield data and (b)-(e) representative cells (red dots) at different threshold levels of R h i g h T : (b) R h i g h T =90%, (c) R h i g h T =50%, (d) R h i g h T =30%, and (e) R h i g h T =10%. All the data were originally compiled by Monfreda et al. (2008). Training data T90, T50, T30, and T10 were obtained by applying a croplands mask (Section 2.6) to (b)-(e), respectively. See Section S1.1 in the Supplementary Materials.

2.7 Random forest

The R package randomForest (Liaw and Wiener, 2022) version 4.7-1.1 for random forest was used, which was originally developed by Breiman (2001) (See also Breiman and Cutler, 2004). Regressors were built to estimate Monfreda’s yield as an objective variable from the 14 explanatory variables listed in Table 1. (A unique exception that regressors were built from two explanatory variables will be written in Sect. 2.9.) Hyperparameters used in this study were same as the default settings (i.e., ntree=500, mtry=floor(14/3)=4 as regression, nodesize=5, and nPerm=1).

Table 1. Variables used in random forest in this study. One objective and 14 explanatory variables were used. Variables averaged, summed, or integrated over the growing period determined in Section 2.5 were indicated in the second column.

Variables expected to affect rice yield were selected as explanatory variables. Although some variables might be correlated with each other, RF can find robust solutions. Two importance metrices were introduced to judge which explanatory variable played crucial roles in yield estimation. (1) IncNodePurity measures an increment of information entropy by changing node splitting of regression tree from its ideal case. (2) IncMSE measures an increment of error by changing values of each explanatory variable. If an explanatory variable of interest is important, both metrics become high for the variable because a small change in the variable gives a large impact, i.e., a large increase in the metrics.

2.8 Validation

To check the performance of the trained regressors, validation datasets should be prepared as independent data from the training datasets (Sect. 2.6). In this study, two validation methodologies were introduced.

First, out-of-bag (OOB) validation is a common validation methodology for RF. Since RF randomly chooses data from a training data set for building regressors, a certain part (approximately, a third) of the training data set is never used for it. Subsets of training data which were used and not used in building regressors were called in-bag (IB) and OOB, respectively. OOB data satisfy independence from IB data and can be utilized for validation. The mean of squared residual and percent of variance explained are common OOB validation metrics (e.g., Liaw and Wiener, 2002). The former measures differences between prediction (averaged over regressors) and observation, whereas the latter gives the coefficient of determination R2 evaluated only from an OOB subset.

In this paper, we adopted another validation methodology using “complete strangers” to the training data because this methodology is considered to be more suitable for our study which ultimately aims compilation of global yield data under limited data availability. Aside from some data sets like Monfreda et al. (2008), it is practically difficult to collect yield data geographically uniformly from all over rice-planted areas in the world. Rather, it is common that yield data are available only for some top rice-producing countries or for some specific areas. Then, it is obliged to estimate yield at sites far apart from available training data sites. However, the OOB validation methodology, the geographical distribution of OOB data is much the same as that of IB data because both are extracted from the same original training data set. Consequently, OOB sites are distributed in the vicinity of IB. The validation score with the OOB might be higher than the actual performance. A stricter validation methodology would be preferred for our study.

This was ensured by using different ranges of R h i g h for training and validation data; choosing R h i g h less than the lower limit of the training data ( R h i g h < R h i g h T ). Thus, the validation data were “complete strangers” to the training data and guaranteed independence from them. As a result, regressors could be validated more critically because we validated over cells where less rice was planted. Besides, the validation data were geographically more widely distributed than the training data (Fig. S3). This mimicked the situation of where RF regressors were built from a limited number of training data. Simultaneously, the validation data were expected to cover a broader yield range than the training data (Fig. S1(a) and Section 2.6).

In summary, for validation, representative Monfreda cells with R h i g h satisfying Eq. (2) were selected.

R h i g h V 1 R h i g h < R h i g h V 2 R h i g h T             (2)

In other words, R h i g h is bounded by both the lower R h i g h V 1 and upper R h i g h V 2 limits. Similar to the training dataset (Section 2.6), GRIPC cropland masks were used to retrieve cropland EVI cells from 400 EVI cells belonging to a representative Monfreda cell. The yield was estimated by applying trained regressors to the retrieved cropland EVI cells, and the performance of the estimated yields was determined by comparing them with reference to Monfreda’s yields.

The performance of different combinations of datasets was evaluated by changing R h i g h T for the training data and R h i g h V 1 and R h i g h V 2 for the validation data. In this study, four experimental cases, as listed in Table 2, were established. The experiment name was given in the form of TxxVyyzz, where T denotes the threshold level of the training data, V denotes the range limits of the validation data, and the two-digit numbers xx, yy, and zz following T and V indicate the harvest area ratio (in percentage). The numbers of valid data (i.e., valid EVI cells) were also shown in Table 2. For example, T90V3010 indicates an experiment with R h i g h T =90%, R h i g h V 2 =30%, and R h i g h V 1 =10%; i.e., R h i g h ≥ 90% was used as training data (N=31200) and 10% ≤ R h i g h < 30% as validation data (N=172734). Among the experiment cases, T90V9010 and T90V3010 were challenging ones to achieve high accuracy because regressors were trained by the dataset T90 covering only a limited geographical extent (Fig. 3(b)) but estimated sites for validation were distributed over wide geographical extents (Fig. S3(b) for V9010, Fig. S3(c) for V3010). We believed such tough experiment settings would help to critically reveal the veiled nature of random forest methodology.

Table 2. Experimental design matrix of this study. Geographical distributions (before applying cropland masks) are shown in designated figures. The numbers of valid data (i.e., valid EVI cells) given as training or validation data sets were shown in square brackets.

2.9 Using random forest as spatial interpolation

We prepared another set of RF regressors built only from the two explanatory variables of geographical coordinates (i.e., longitude and latitude) and compared the results with those for the 14 variables. RF regressors built only from geographical coordinates can be treated as a complete spatial interpolation (or extrapolation) problem. In fact, longitude and latitude have often been chosen as explanatory variables for spatial interpolation in the literature (e.g., Jeong et al., 2016; da Silva Júnior et al., 2019; Guilpartet al., 2022; Wu et al., 2023).

3. Results

3.1 The training performance of regressors with different levels of R h i g h

Figure S4 shows the fitting performance of the RF regressors at different levels of R h i g h T and whether the yield estimated by the regressor reproduced the Monfreda yields (training data). The data points lay along the 1:1 line, implying that the regressors were almost perfectly trained to estimate the yield. Although training data with lower threshold levels, such as T30, were anticipated to increase the risk of being contaminated with non-rice fields, the results showed successful performance.

Both the training and estimated rice yields were distributed over a broader range for lower threshold levels (Table 3 and Fig. S4). In particular, the upper bound of the estimated yield increased markedly as the threshold R h i g h T decreased. This reflects the geographical tendencies of rice yield; from low latitudes to mid-latitudes, the Monfreda yield increases (Fig. S1(e)(f)), whereas R h i g h decreases (Fig. S1(c)(d)). See also Fig. 3 for geographical distribution.

Table 3. Monfreda yield range of training data. Geographical distributions (before applying cropland masks) are shown in designated figures.

3.2 Validating the performance of regressors with OOB data

The results of OOB validation (when regressors were built from the IB training data and applied to its OOB data) are shown in Table 4. Both the percent of variance explained, which approximately gives R2 (coefficient of determination), and root-mean-square errors (RMSEs) showed sufficiently practical performance. RMSEs slightly decreased with decreasing the threshold level like T30.

Table 4. Statistical summary of validation results when out-of-bag (OOB) validation data were applied. The yield range of the training data are listed in Table 3. The metrics %Var and RMSE give the percent of variance explained (which approximately gives R2, the coefficient of determination) and the root-mean-square error, respectively.

3.3 Validating the performance of regressors at different levels of R h i g h

The validation of regressors’ performances (built from the training data and applied to the validation data at different levels of R h i g h ) is shown in Fig. 4 and Table 5(a). The geographic distribution of estimated yield is shown in Fig. S5. For all the panels of Fig. 4, the data lay along the 1:1 with RMSEs ranging from 1.064 to 1.592 t ha-1, but the distribution was more broadly scattered than that for the training performance (Fig. S4 or Table 3), especially stretching horizontally.

Fig. 4.Validation performance of the regressors when different training and validation datasets were used. Panels (a) and (b) show the validation results among the different validation datasets when the regressors were built from the common training dataset R h i g h T =90% (T90). Panels (b)-(d) show the validation results among the different training datasets when the common validation dataset R h i g h V 2 =30% and R h i g h V 1 =10% (V3010) was used. Experiment names are (a) T90V9010, (b) T90V3010, (c) T50V3010, and (d) T30V3010, respectively (Table 2). Superposed color tiles indicate the relative density (≥0.02%) of data points.

Table 5. Statistical summary of validation results when validation data with different levels of R h i g h were applied. The yield range of the training data are listed in Table 3. The metrics R2 and RMSE give the coefficient of determination and the root-mean-square error, respectively.

The features of the scatter plots differed among the different validation datasets (Fig. 4(a) and (b)) and different training datasets (Fig. 4(b)-(d)); in particular, the latter was crucial and determined the overall accuracy. First, the training dataset ( R h i g h T =90%) and the lower bound of the validation datasets ( R h i g h V 1 =10%) were fixed and validation performance was checked by changing R h i g h V 2 (Fig. 4(a) and (b)). For the T90V3010, the data stretched horizontally rather than diagonally. The upper bound of the estimated yield (Table 5(a)) was comparable to that of the training yield data T90 (Fig. S4(a) and Table 3). The amount of validation data changed with the change in the value of R h i g h V 2 ; however, the upper bound of the estimated yield remained almost unchanged.

Next, the validation data ( R h i g h V 1 =30% and R h i g h V 2 =10%; V3010) were fixed and validation performance was checked by changing the threshold level of R h i g h T (Fig. 4(b)-(d)). By using lower threshold levels of R h i g h T , the data points lay more closely along the 1:1 line, increasing the R2 metric. Note that the horizontal range of the data (i.e., Monfreda yield) was unchanged for all cases because the common validation data V3010 were used. However, the range of the estimated yield varied with R h i g h T ; the higher the adopted R h i g h T , the narrower the range of the estimated yield. Consequently, for a higher R h i g h T , the estimated yield was underestimated for cases with higher Monfreda yields and overestimated for cases with lower Monfreda yields (i.e., the slope was gentler than the 1:1 line). The range of the estimated yield (Table 5(a)), which changed with the threshold levels of R h i g h T , was comparable to (and never beyond) the training data yield range (Fig. S4(a) and Table 3).

These results show that when the threshold levels for training R h i g h T decrease, the range of the estimated yield gradually broadens (in particular, the upper bound increases), resulting in an improvement in the validation performance. In other words, even if yield data for training were collected from areas where rice was densely planted, the training data would have been inadequate for global yield estimation. As we already mentioned in Sections 2.6 and S1.1, the mutual relationship between three variables (Monfreda yield, R h i g h T , and the latitude) existed. Thus, the yield range of training data depended on the threshold levels of R h i g h T . Therefore, to achieve high performance in yield estimation using RF, training data should be collected from the area where the yield values are as broadly distributed as possible to reproduce the actual yield values. In other words, training data should be collected from as wide a latitude range as possible.

To investigate regional validation performance, we set two regions, one with low yield (near the boundary of Thailand-Cambodia) and the other with high yield (near the lower reach of the Changjiang (Yangtze) River, China) and extracted validation data from Fig. 4. The results are shown in Figs. S6 and S7. The distribution lay near the 1:1 line but stretched over a line with a gentler slope for higher threshold levels of R h i g h T . Typically, regressors built from T90 returned almost same yield values despite a wide range of Monfreda yield (Fig. S7(b)(c)). The performance improved with decreasing R h i g h T (Fig. S7(d)(e)).

3.4 Importance of the explanatory variables

The importance of explanatory variables is a metric of their contribution to yield estimation. This information helps judge which explanatory variables should be incorporated in building regressors or how regressors can be simplified by reducing the number of variables without deteriorating the overall performance.

Figure 5 shows the importance of the explanatory variables with different threshold levels of R h i g h T using the two metrics. Counterintuitively, yield-related variables (e.g., CYGMA yield, CYGMA biomass, and EVI integrated over the growing period) were less important. The variables describing geographical coordinates (longitude and latitude) showed the highest importance. The growing period and stress index of nitrogen shortage followed them for the node purity metric, whereas meteorological variables (precipitation and air temperature) followed them for the mean squared error metric. For higher thresholds of R h i g h T , the geographical coordinates scored far higher importance than the other variables.

Fig. 5.Importance of explanatory variables with different threshold levels of R h i g h T of the training data: (a) and (d) R h i g h T =90% (T90); (b) and (e) 50% (T50); and (c) and (f) 30% (T30). Two types of importance metrics, (a)-(c) node purity and (d)-(f) mean squared error, were evaluated. Importance increases rightward. Abbreviations of variables on the vertical axis are defined in Table 1.

3.5 Contribution of geographical coordinates to the estimated rice yield performance

The regressors built only from geographical coordinates performed well (Fig. 6 and Table 5(b); see Fig. S8), with RMSEs of 1.211-1.771 t ha-1, which was slightly larger than the corresponding case of 14 variables (Fig. 4 and Table 5(a)). The data lay near the 1:1 line but were more broadly scattered than in the case of the 14 variables (Fig. 4(b) and (d)). The characteristics observed upon changing R h i g h T were similar to the case of the 14 variables.

This advantage of the case of the 14 variables over that of the geographical coordinates was also confirmed by direct comparison of absolute misfits in yield (absolute values of misfit calculated as estimated yield minus Monfreda yield). In the scatter plots (Fig. 7), data points were more densely distributed below the 1:1 line than above. The slope of the regression line fitting to the data was < 1. These results indicated that absolute misfits for the 14 variables were smaller than those for the geographical coordinates. Thus, incorporating a combination of different data sources (census, satellite, and crop model output) for building RF regressors had an advantage in accuracy. Note that the number of valid data for the 14 variables was smaller than that for the geographical coordinates because yield could not be calculated when any value of explanatory variables was missing.

Fig. 6.Validation performance of regressors built only from the longitude and latitude with different threshold levels of (a) R h i g h T =90% (T90V3010) and (b) R h i g h T =30% (T30V3010). The validation data considered here are V3010. Superposed color tiles indicate the relative density (≥0.02%) of data points.

Fig. 7.Absolute misfits in yield (i.e., absolute values of misfit calculated as estimated yield minus Monfreda yield) using regressors built only from the geographical coordinates (horizontal axis) versus ones built from the 14 variables (vertical axis). This distribution relative to the 1:1 line gives which regressors were superior (i.e., with smaller misfits) in terms of accuracy. The validation data used here are V3010. The threshold levels for training data are (a) R h i g h T =90% (T90V3010) and (b) R h i g h T =30% (T30V3010). Superposed color tiles indicate the relative density (≥0.02%) of data points. When a regression line passing through the origin fits, the slope and its standard deviation (in parentheses) are also indicated. See also Section S1.2 for distance-dependent characteristics.

An observed tip was noted when Figs. 4 and 6 were compared. Data points gathering in vertically aligned lumps were observed in Fig. 4, whereas no such lumps were observed in Fig. 6. The lumps were primarily attributable to input variables other than geographical coordinates into regressors: When the regressors built only from the geographical coordinates, the estimated yield values were similar for different EVI cells in a single Monfreda cell because differences in the geographical coordinates were too small to deviate from the yield values. Thus, the data points in a single Monfreda cell fell into the same point in Fig. 6. However, when the regressors built from the 14 variables were used, the estimates were different among EVI cells in a single Monfreda cell because of the different EVI values or growing periods. Consequently, the data in a single Monfreda cell spread vertically. Fig. 7 also showed similar vertically aligned lumps.

3.6 Performance depending on location or distance from training data sites

To determine whether the RF performance depends on the location or distance from training data, Figs. 8 and 9 were plotted. The minimum distance from the nearest training data sites to the validation sites (Panel (a)) tended to be large for archipelagoes or near the peripheries of rice-cropping areas. The misfits for the cases of 14 variables and geographical coordinates (Panels (b) and (c), respectively) showed similar geographical patterns. Despite the similarities between Panels (b) and (c), the differences in absolute magnitude (absolute values of (c) minus absolute values of (b)) showed that positive values outnumbered negative ones (Panel (d)), indicating that regressors incorporating the 14 variables outperformed at a larger number of sites. The advantage for the case of the 14 variables was also shown in Section S1.2 of the Supplementary Material, regardless of the distance from the nearest training data sites.

Fig. 8.Geographical distribution of (a) the minimum distance from the nearest training data sites at validation sites, (b) misfits (estimated yield minus Monfreda yield) for the case of 14 variables, (c) misfits for the case of geographical coordinate, and (d) differences in absolute magnitude of misfit (i.e., absolute value of (c) minus absolute value of (b)) for T90V3010.

Fig. 9.Geographical distribution of (a) the minimum distance from the nearest training data sites at validation sites, (b) misfits (estimated yield minus Monfreda yield) for the case of 14 variables, (c) misfits for the case of geographical coordinate, and (d) differences in absolute magnitude of misfit (i.e., absolute value of (c) minus absolute value of (b)) for T30V3010.

A comparison of corresponding panels between Figs. 8 and 9 showed how differences in training datasets affected the results. The distance for T30V3010 was generally smaller than that for T90V3010 because the training data T30 were more widely distributed globally and more densely distributed over rice-cropping areas (panels (a)). Biases also geographically clustered, but with much smaller magnitude for T30V3010 than T90V3010 (panels (b)).

Areas with relatively large misfits (i.e., biases) geographically clustered (Figs. 8 and 9). In each cluster, the magnitude of bias was relatively homogeneous. For example, in Fig. 8(b), a couple of clusters with negative large biases were observed around 30°N in China, whereas clusters with positive large biases were observed in Madagascar, the Philippines, and areas around Cambodia. These geographically clustered biases were attributable to ill-performance of RF: within each cluster, estimated yield values had less variability than actual yields. In fact, Fig. S6(c) for areas around Cambodia and Fig. S7(c) for areas around 30°N in China showed that data stretched horizontally. These panels correspond to regional extraction of data for the case of T90V3010 (Fig. 8(b)). These biases also improved with decreasing R h i g h T (in comparison of Fig. 9 with Fig. 8).

4. Discussion

4.1 Pros and cons of using random forest

RF is a powerful analysis tool for obtaining a relatively robust solution because the resultant solution is averaged over an ensemble set of yield values, each of which is estimated using a regressor (a series of regression trees). An ensemble mean hardly produces an ‘outlier’ because it tends to lie closer to the center of the data distribution. Therefore, it is reasonable for the RF yield values to lie within the yield range of the training data (Section 3.3).

RF requires the training data to be an unbiased subset of all possible data to be representative of the overall distribution - at least, the training data to cover a possible range of yield. If the training data are biased, the estimation will also be biased, as discussed in Section 3.3. Therefore, this issue was addressed by considering the global nature of the data used in this study (Sections 3.1, 3.3, and S1.1). The training data, T90, were geographically concentrated at low latitudes in Southeast Asia, where the yield ranged from 1.3-6.8 t ha-1 (Table 3). Even if such regressors were applied to data at mid-latitudes, where the actual yield ranged more widely (e.g., 1 to over 10 t ha-1 for V3010), the estimated yield would be bound to the range of the training data (Table 5). Instead, training dataset T30, geographically distributed over both low- and mid-latitudes across continents, covering a wider yield range. Since such training data are better representative of a possible yield range, regressors built from the data are applicable to various sites on Earth. This implies that, for global yield estimation using RF, collecting a smaller amount of training data per site but at a larger number of sites will achieve a higher performance than collecting a larger number of training data per site but only at a smaller number of sites, even if the total amount of training data remains the same.

Therefore, the selection of training data plays a crucial role in determining the overall accuracy of the yield estimation using RF. This might be a pitfall of RF because one might misunderstand as if RF automatically finds a robust solution anytime and anyhow regressors were built. The geographical coverage of the training data was focused on the spatial interpolation problems. However, more attention should be paid to the range of objective variables (yield, in this case). Recent studies have reported the ability of RFs to extrapolate outside the training data range (Berezowski and Chybicki, 2018; Hengl et al., 2018; Li et al., 2019a; Goulart et al., 2021; Ikeda and Kusaka, 2021; Silva et al., 2023).

It should be kept in mind that even if the training data were improperly sampled from a possible yield range, no alerts or warnings would be issued during the RF calculation. Since the calculation would normally be terminated in appearance, one would hardly perceive a deterioration of the estimated results. In published papers, one can find results showing that estimated yields were unnaturally bounded (e.g., Fig. 4 of Wu et al., 2023). We have no confidence but guess that some of these results might be attributable to the yield range of training data. This finding on yield coverage for the RF model is essentially the same as the applicability of conventional statistical models (e.g., multiple linear regressions), which depends on the range of the training data (Lobell and Burke, 2010; Parkes et al., 2019). Special attention must be paid to the representativeness of the training data in advance.

Last but not least, appropriate validation methodologies for one’s purpose should be introduced. In this paper, validation performance was evaluated using two methodologies. As we expected, nominal values of performance metrics using the OOB data were superior to ones using data at different R h i g h levels beyond the geographical extent of the training data (Tables 4 and 5). In case regressors which were built under some restrictions (e.g., limited data availability, geographical distribution) apply to “complete stranger” data under less restrictions, the actual performance might be lower than the nominal OOB performance.

4.2 Application of random forest to spatial interpolation problems

Spatial interpolation has been widely used to fill the spatial gaps between available data points. Moreover, it works effectively when the yield data are densely distributed over a target area. Recently, the number of studies using the RF as an advanced type of spatial interpolation has increased in various fields (e.g., Leirvik and Yuan, 2021; Sekulićet al., 2021). The performances of RFs have also been investigated through cross-comparisons among various methodologies (Li and Heap, 2011; Ohashi and Torgo, 2012; Sekulićet al., 2020). However, their nature could not be understood completely, particularly in terms of crop production. Therefore, learning more about the nature of the examples and lessons in agricultural studies is important. Consequently, the findings of this study regarding RF spatial interpolation are discussed from this viewpoint.

Longitudes and latitudes are often used as explanatory variables. This is rational because these variables scored the highest in importance among the 14 variables (Fig. 5). In the real world, crop yield sometimes changes across geographical structures, such as topographic lineaments (e.g., mountain ranges and valleys) or geographical boundaries of soil classification. Since the geographical coordinates are independent of any geographical structure, spatial interpolation using these variables cannot reproduce the structures unless the data are sufficiently densely distributed around the structures. If some variables are sensitive to the structure, the performance improves.

To compile yield data at global to regional scales, spatial gaps in original yield data must be filled. If only the longitude and latitude are chosen as explanatory variables (Section 3.5), RF will act as spatial interpolation. However, in contrast to most classical spatial interpolation techniques (e.g., inverse-distance weighting interpolation, polynomial interpolation, spline interpolation) which smoothly fill spatial gaps with continuous functions, RF spatial interpolation discontinuously fills the gaps with some jumps in values. This nature was clearly observed when RF spatial interpolation (or extrapolation) was applied to the entire Earth as an extreme case, regardless of its validity or feasibility (Fig. S9). The yield was estimated at a finer geographical resolution for areas where the training data were densely distributed (Fig. 3), but at a coarser resolution for areas where less training data were distributed. As a result, the interpolated surface was divided by blocky artifacts sectioned parallel to the longitude and latitude lines. This was attributable to the RF algorithm because of the switching to different nodes of the regression trees at some longitudes and latitudes, as already reported (e.g., da Silva Júnior et al., 2019). Global-to-regional yield data compiled using RF are prone to produce meaningless boundary structures like Fig. S9 over areas where training data are sparsely distributed. Data producers should call users’ attention to such artifacts not to misinterpret them.

Li et al. (2011) compared various interpolation methodologies, including RF, and found that RF and its related methodologies outperformed the others. They also found that, for an extrapolation problem, the estimations of RF were better than those of other methodologies. However, their analysis settings were slightly different from those adopted in this study in the following manner: (1) As an object variable, they dealt with the mud content ratio in weighted percentage, strictly ranging from 0 to 100%, whereas this study dealt with rice yield, which was not clearly confined to any range. (2) Training data were used heterogeneously but geographically sufficiently covering the target area (Fig. 1 of Li et al. (2011)), whereas training data in this study were geographically partly covering the target area, particularly for smaller R h i g h (Fig. 3). Thus, the estimated yield of this study could be outside the yield range of the training data, whereas the estimated mud content ratios tended to be within the ratio range of the training data in Li et al. Thus, the current statement does not contradict theirs.

Crop yield also changes with human factors, such as cropping maneuvers, infrastructure, and the choice of rice varieties. In fact, crop yields sometimes interrupt changes across the borders of administrative sectors. These effects were partly incorporated in the cases of 14 variables via explanatory variables (e.g., EVI, soil fertility, and water shortage). Consecutively, incorporating satellite- or simulation-based variables into spatial interpolation helps improve the performance. A hybrid data source (census data, satellite observations, and numerical simulations) in combination with RF, as in this study, also helps produce global crop yield data.

We consider that the methodology of this study can be applied to yield estimation of crops other than rice because of the advantages in (1) filling spatial gaps of data by interpolation, (2) dampening unwilling spatial instability, and (3) making spatial boundary structures. Regarding (1), collecting cultural data from all the croplands is unrealistic. Spatial interpolation methodologies will also work for other crops. For (2), any observation data contain errors. Besides, phenology independently determined at each EVI cells has no spatial linkage even between adjacent EVI cells. Such spatially high variability that does not reflect actual signals of crop yield should be removed. Finally, regarding (3), a combination of different data sources can produce geographical boundary structures, which cannot be obtained by a simple interpolation but only smoothing the surface. Since yield is a fundamental metric in agriculture, not only for food crops but also for cash crops (Saeidi et al., 2022), these advantageous factors will also be welcomed to studies on other agricultural crops.

4.3 Possible error sources and room for improvement in yield estimation

Several possible sources of error existed in rice yield estimation. First, the contamination of the EVI signals and imperfect land masks can be sources of error. Unconsciously, regressors may be trained using cells in which less rice is planted. However, creating a perfect rice cropping map is almost impossible (Section 2.6). Since it is rare for rice to be planted exclusively on land cells, EVI signals are somewhat contaminated by reflections from other land uses within a cell. Even if only rice is cultivated in a single EVI cell, a mixture of different timings of rice sowing/transplanting can disturb peak EVI timing, leading to a misinterpretation of the rice growing period. These factors may lower the importance of satellite-derived variables, such as the EVI, in the RF (Fig. 5).

Second, the choice of explanatory variables may also deteriorate overall accuracy. Intrinsically, an RF methodology is robust against the choice of variables because of their importance in building regressors. However, Li et al. (2011) reported the possibility that the inclusion of ill-performing variables may deteriorate accuracy via node division processes in RF. Silva et al. (2023) pointed out that since RF performance varies depending on crops, countries, and seasons, a well-performed set of explanatory variables change accordingly, across time and space.

Estimation of rice production over a certain area is important for food supply, but we should consider how to handle invalid data and biases before that. Regarding the former, remind that rice production from a land cell is evaluated as the product of the yield, cell area, and harvest area ratio. If any value of these variables is invalid (e.g., missing), rice production will inevitably be invalid. EVI cells with ill-determined phenology (e.g., rice would not mature) are also invalid. Regarding the latter, RF-estimated yields have biases in some regions (Figs. 8 and 9). Spatial aggregation of rice production over a certain extent of areas will sum up not only signals but also biases. Biases with a same magnitude will never be cancelled out by summation - just only cumulate. However, since the magnitude of bias decreased by using geographically densely distributed training data (Fig. 9 with respect to Fig. 8), the latter will be solved by lowering the threshold R h i g h T .

There is also some room for improvement in the crop models. Crop models cannot fully incorporate all the factors in conventional cultural practices, such as fertilization timing/intensity, irrigation timing/intensity, cultural techniques, rice varieties, natural disasters, crop diseases, and insect damage. These factors vary significantly from place to place, even among farmers in the same community. Moreover, multiple rice crops per year are commonly grown in tropical and subtropical regions. Different rice varieties are often chosen there for different cropping seasons due to consideration of local requirements such as water availability, temperature conditions, day length, and commercial markets (e.g., Nguyen and Kawaguchi, 2002; Biswas et al., 2019; Iwahashi et al., 2021). Different rice varieties have different growing periods owing to their different sensitivities to temperature and day length (e.g., Yoshida, 1981; van Oort et al., 2011; Fukui et al., 2015). Besides, farmers also choose different irrigation timings/intensities or fertilization timings/intensities in a suitable manner, depending on the season or site. Consequently, the actual farm-level yield reflects various factors (e.g., Silva et al., 2022). Such fine-tuned simulation settings have rarely been incorporated into global crop model simulations. If the crop model’s performance is improved, the importance of these variables in RF will increase accordingly.

The actual rice yields revealed by plot surveys sometimes differ substantially from place to place, even within an administrative unit. Since the performance was evaluated in comparison with the reference Monfreda yield by assuming a uniform yield value over the unit, this could be another source of error.

Practically, a trade-off occurs among the accuracy of the estimated yield, computational load, and labor for preparing training data. Therefore, to improve the overall accuracy, the training data should be sampled uniformly and unbiasedly using the widest possible range of data. However, an increasing amount of training data would increase the computational load, especially when building regressors in the RF. The strict selection of pure rice cells free from contamination by other land uses requires more laborious work for data qualification.

5. Conclusion

In this study, rice yield was estimated using a combination of different data products (census data, satellite observations, and crop model outputs) on a global scale using the RF methodology. Crop calendars were determined for each EVI cell (500 m resolution) using satellite data. Geographical coordinates (longitude and latitude) played crucial roles as explanatory variables in terms of their importance. However, other variables derived from satellite observations or crop model simulations also contributed to performance improvements.

Since RF is a cutting-edge methodology applicable to interpolation problems, lessons from analysis must be collected and utilized. The following remarks were deduced:

A possible yield range should be covered by the training data. Otherwise, the estimated yield will be wrongly confined to a narrow range, resulting in performance deterioration. This is because RF hardly estimates the yield values beyond the yield range of the training data. Therefore, more attention should be paid to the yield range and geographical coverage of the training data. If yield data are collected over areas where rice is densely planted, they will function less as training data for global estimation because of the limited yield range.

To further improve the yield estimation, the quality of satellite- or simulation-related variables should be improved to increase their importance (Fig. 5). By incorporating these variables, more benefit from their intrinsic nature can be achieved, i.e., the high geographical resolution of satellite-based variables, global coverage of numerical simulation, and determination of the crop calendar, which are all useful in agriculture. Accurate land-use data can also reduce errors by improving the training process.

To build versatile RF regressors, they should obtain high generalization performance. This is confirmed through validation. In agricultural application, it sometimes requires that yield must be estimated over a wide geographical extent from observation data over a limited geographical extent. However, a common OOB validation methodology might overrate the generalization performance because the geographical distribution of OOB validation data is much the same as that of in-bag training data. Our validation methodology based on R h i g h is advantageous in geographically separating validation data from training data, resulting in adequate evaluation of the generalization performance.

Although the rice yield was the focus of this study, the analytical method can be applied to other crops. Given the growing global population and the risk of unstable agricultural production due to meteorological factors and climate change, it is important to acquire accurate information on global crop production. Since yield estimation involves a very complex science linking nature (e.g., meteorological, geographical, and edaphic factors) and human activities (crop maneuvers, infrastructure, markets, and farmers’ choices), ML techniques have high potential for analyzing and solving agricultural problems of the present era.

Acknowledgments

This study was financially supported by JSPS KAKENHI (grant number: 22H00577). We acknowledge two anonymous reviewers for improving the manuscript. MODIS satellite data were downloaded from the United States Geological Survey (USGS) within the framework of Distributed Active Archive Centers directed by the National Aeronautics and Space Administration (NASA). Rice yield data compiled by Monfreda et al. (2008) were provided via EarthStat (http://www.earthstat.org/) operated by the Global Landscapes Initiative at the University of Minnesota’s Institute on the Environment and the Land Use and Global Environment lab at the University of British Columbia.

References
 
© Author (s).

This article is licensed under a Creative Commons [Attribution 4.0 International] license.
https://creativecommons.org/licenses/by/4.0/
feedback
Top