Intraseasonal and interseasonal applicability of a neural network model for real-time estimation of the number of air exchanges per hour of a naturally ventilated greenhouse

Neural network ( NN ) models with environmental data and the extent of ventilator openings as inputs have the potential to estimate the number of air exchanges per hour ( N ) in real time of a naturally ventilated greenhouse. In this study, the intraseasonal and interseasonal applicability of an NN model was verified: whether the model trained in a specific period can be applied to different periods of the same and other seasons. First, the effect of data collection periods for model training and test within the same season on the estimation accuracy of N was examined. The estimation accuracy was lowered even though the model was applied to a period immediately following that used for model training. Adjusting the training dataset so that the relative distribution of the temperature difference inside and outside the greenhouse ( ∆ T ) approaches the relative distribution of the test dataset improves the estimation accuracy slightly. However, when the model was applied to interseasonal data, such training data adjustments did not improve the estimation accuracy. This indicates that the NN model needs to be further improved for practical use to estimate N of naturally ventilated greenhouses.


Introduction
The number of air exchanges per hour N is used to quantify the extent of ventilation of a greenhouse. It is known that the N of a naturally ventilated greenhouse is dynamically affected by the current internal and external environment and the extent of ventilator openings. If N is estimated in real time, we can calculate the exchange rate of CO 2 , water vapor, and energy /enthalpy between the greenhouse and the outside in real time via ventilation. This can contribute to the cost-effective prompt environmental control, such as CO 2 enrichment, even when the ventilators are open. However, unlike a greenhouse with forced ventilation, there is no practical method for estimating N of a naturally ventilated greenhouse in real time with practically enough accuracy.
There are several methods to calculate N of a naturally ventilated greenhouse Boulard, 2006 . The methods include the tracer gas method with CO 2 Okada and Takakura, 1973;Nederhoff et al., 1985;Boulard and Draoui, 1995 , the energy balance method Kozai et al., 1980;Fernandez and Bailey, 1993 , the water balance method Takakura et al., 2009Takakura et al., , 2017 , and the methods based on aerodynamic Kozai et al., 1980;Sase et al., 1980;Boulard and Baille, 1995;Boulard et al., 1996;Kittas et al., 1997or neural network Seginer et al., 1994Boulard et al., 1997 models. Among them, we have recently shown the potential availability of the NN models with internal and external environmental data and the extent of ventilator openings as the model inputs for real-time N estimation for a naturally ventilated uncropped greenhouse Matsuda et al., 2019 . Using data collected in an uncropped experimental greenhouse in the winter, NN models were trained and validated according to 10-fold cross-validation. The NN models tended to exhibit a higher accuracy in estimating N of the greenhouse than the aerodynamic and empirical models. On the other hand, several unignorable points remained uninvestigated in that study. One of them was the applicability of the model to a period different from that used for model training and validation. If real-time N estimation based on an NN model is intended to be used for a target greenhouse throughout the year, the model must first be trained with data collected over a short period in an uncropped season. Then, the model can be applied to the same greenhouse with plants to estimate N in real time.
In this study, we tested the intraseasonal and interseasonal applicability of an NN model: whether the model trained in a specific period can be applied to different periods of the same and other seasons. First, the intraseasonal applicability was examined. The NN model was trained with and tested against different periods within the same season winter . Second, the training data for the NN model were selected and the entire training dataset was adjusted to improve the estimation accuracy. The training data were sampled so that the relative distribution of the dataset "resembles" that of a target test dataset, and its effect on improving the accuracy of estimation was verified. Third, we investigated the interseasonal applicability of the NN model trained with a dataset with or without relative distribution adjustment. The results suggested that the NN model can be applied to different periods in the same season after adjusting the training dataset. However, there are still issues to be addressed to apply the model to the other seasons.

Greenhouse specifications
The experimental greenhouse used in this study was in the head office of Seiwa Co., Ltd. in Shimotsuke, Tochigi, Japan 36 22 N . A multi-span Venlo-type greenhouse was divided into eight single-span compartments and one compartment was used for the measurements. The roof covering material was the ethylene-tetrafluoroethylene film F-CLEANTM Clear, AGC Green-Tech Co., Ltd., Tokyo, Japan . For the outer side wall, acrylic multi-layer panels and polycarbonate panels were used for the lower up to 2 m from the ground and the remaining upper portions, respectively. The inside compartments were partitioned with figure and transparent glass for the lower and upper portion, respectively, and were spatially isolated from each other. The compartment was oriented northeast and southwest. The length, width, eave height, and ridge height of the compartment were 20.0 m, 8.0 m, 5.8 m, and 8.0 m, respectively. The floor area and volume V of the compartment were approximately 160 m 2 and 1,100 m 3 , respectively. There were ventilators on each side of the roof and the extents of the ventilator openings were independently adjusted based on the levels of environmental factors inside and outside the greenhouse using a greenhouse environmental measurement system Priva Connext; Priva B.V., De Lier, The Netherlands . On the southwest wall, there were four fans with louvers. Fans were off and louvers were open during the experiment. The entire floor inside the greenhouse was covered with concrete or impermeable mulch. There were no crop plants in the compartment during measurements.

Measurements
Air temperature inside T i and outside T o the greenhouse and solar irradiance outside the greenhouse I were measured and recorded by the greenhouse environmental measurement system. The extent of the windward R w and leeward R l roof ventilator openings was evaluated as the relative vertical height of the openings to the maximum height 1.16 m . The wind speed U and direction outside the greenhouse were measured with an anemometer PGWS-100; Gill Instruments Limited, Hampshire, UK . The wind speed components orthogonal and parallel to the ridgepole U x and U y , respectively were calculated. To calculate N, pure CO 2 gas was released into the greenhouse at a constant rate of 10 L min 1 S at 20 C, 101 kPa using a mass flow controller 3655; KOFLOC Corp., Kyoto, Japan throughout the daytime. The outlet of gas tube for pure CO 2 was fixed in front of an oscillating fan that was placed at the center of the greenhouse compartment so as to homogenize CO 2 concentration in it. The concentration of CO 2 inside C i and outside C o the greenhouse was measured using two nondispersive infrared CO 2 analyzers ZFP9; Fuji Electric Co., Ltd., Kanagawa, Japan . The air inside and outside the greenhouse was passed through a thermoelectric dehumidifier DH-209C-1-R; KELK Ltd., Kanagawa, Japan and supplied to the analyzers at a constant flow rate of 0.5 L min 1 at 20 C, 101 kPa using mass flow meters 3810DS-V; KOFLOC Corp. . The data were recorded using a data logger GL-220; Graphtec Corp., Yokohama, Japan . The environmental data were recorded every 10 s, except for I, R w , and R l which were recorded every 5 min.
The experiments were conducted in the winter, summer, and fall of 2019 Table 1 . We excluded all-day data, including the hours when T i in summer exceeded 50 C, as it was in the non-operating range of the mass flow controller for pure CO 2 gas released into the greenhouse.

Calculation of the number of air exchanges per hour
The N of a greenhouse is defined by the following differential equation: where P n is the amount of net photosynthetic CO 2 absorption of all crops in the greenhouse per unit of time. Assuming that P n is 0 for an uncropped greenhouse, To solve Eq. 2 to obtain N, the differential coefficient of the left differential term at time t = t a [ min ] was approximated as: where C i t a is the centered moving average of C i between t a 15 and t a + 15.

NN model
The NN model used to estimate N was the same as that reported by Matsuda et al. 2019 , with some exceptions. Measured variables used for the input layer independent variables or the feature vector were U x , U y , R w , R l , T i , T o , and I. These were selected because an NN model with these variables and the wind direction as the input layer showed the highest N estimation accuracy in our previous study Matsuda et al., 2019 . As the effect of wind direction on N was not necessarily large Matsuda et al., 2019 , it was removed here. The input layer consisted of these variables measured at the same time as that for N estimation and those measured at 1, 2, 3, 4, and 5 min before the time for N estimation, assuming that the effects of environmental conditions on N can be delayed Matsuda et al., 2019 . For I, R w , and R l , linear interpolation was used to estimate the 1-min data from the 5-min data. The output layer or dependent variable was N. There were two hidden layers between the input and output layers, and each of the two hidden layers had 20 units. The hyperparameters used were the same as those reported by Matsuda et al. 2019 , except that the batch size was 1,024. The NN model was trained with mini-batch gradient descent using backpropagation.

Intraseasonal applicability of NN model
The data measured between January 23 and February 4 were randomly divided into two datasets. The sizes were 75 and 25 of the original dataset, and the former and latter datasets were used as training and validation datasets, respectively. This validation dataset was referred to as the "simultaneous" validation dataset. The data measured between February 6 and 10 were also used as a test dataset, which was referred to as the "non-simultaneous" test dataset. This indicated that the data were acquired in a different period from that for the training dataset. Root mean square error RMSE and the coefficient for determination r 2 for the 1 : 1 relationship between the measured and the estimated N were compared.
2.6 Sampling of training data to adjust the relative frequency distribution of temperature difference The relative frequency distribution of temperature difference inside and outside the greenhouse T i T o , hereafter referred to as ∆T in the dataset used to train the NN model was adjusted to approximate that in the test dataset based on the following steps: i Two feature vectors i.e., a combination of U x , U y , R w , R l , T i , T o , and I having the maximum ∆T ∆T max and minimum ∆T ∆T min values in the test dataset, respectively, were selected. The difference between ∆T max and ∆T min ∆T max ∆T min was divided into 500 equal intervals, d k k = 1, 2, ..., 500 : ii The number of feature vectors with ∆T in each d k of the test dataset c k was counted. Then, the relative frequency distribution of ∆T, p k = c k /Σ k 500 = 1 c k 0 ≤ p k ≤ 1 was obtained for the test dataset.
iii In the training dataset, the feature vectors with ∆T within each d k were defined as the set X k . The number of elements of X k was n k , and the total number of feature vectors of the entire training dataset was n t = Σ k 500 = 1 n k . For the adjusted training dataset, the feature vectors were randomly sampled from each X k without replacement so that the number of samples became n t p k for each d k . If n k < n t p k for a given d k , the number of samples became n k . If p k = 0 or n k = 0 i.e., X k was the empty set , the number of samples became 0. iv Finally, the relative frequency distribution of ∆T of the adjusted training dataset was obtained, approximating that of the test dataset.

Interseasonal applicability of NN model
Training and test datasets were taken from winter, summer, and fall with or without adjustments of the training datasets Table 1 . RMSE and r 2 between the measured and the estimated N for the test datasets were compared. Table 2 shows the summary of environmental conditions in winter, summer, and fall. Daily mean values of all environmental factors, except for T o and T i , were not much different among the seasons.

Results and Discussion
First, we tested the effect of data collection period for model training and validation/test during the same season winter on N estimation accuracy Fig. 1 . The RMSE was higher for the non-simultaneous test dataset than the simultaneous validation dataset. Similarly, r 2 for the 1 : 1 relationship between the measured and the estimated N was lower for the non-simultaneous test. Therefore, the overall estimation accuracy of N using the NN model was reduced by applying the model to a different period from that used for model training.
Let us discuss here how accurate estimation of N is required for practical greenhouse environmental control, although it largely depends on the grower's purpose. Assume that one intends to utilize estimated N for quantifying net photosynthetic rate of whole crop plants in a greenhouse in real time to account for the effect of CO 2 enrichment. The required accuracy of N estimation for that purpose may be roughly calculated from Eq. 1.
As a realistic condition for the greenhouse used in this study while plants are assumed to be grown therein, we assign V = 1,100 m 3 ≈ 45,500 mol at 20 C, 101 kPa , C o = 400 μmol mol 1 , and C i = 1,000 μmol mol 1 , and σ Pn = 27.3 σ N 7 is obtained. Because the standard deviation and RMSE will have the same order of magnitude, we substitute a value of 0.63 h 1 , which was observed as RMSE in the "non-simultaneous" test dataset Fig. 1 , for σ N , and finally a σ Pn of 17.1 mol h 1 is obtained. Next, let us assume that plants grown in the greenhouse have an average leaf net photosynthetic rate of 10 μmol m 2 s 1 and a leaf area index of 4 m 2 m 2 . Given that the floor area of the greenhouse was 160 m 2 , P n is computed at 23.0 mol h 1 10 10 6 3,600 4 160 = 23.0 . Collectively, the σ Pn value of 17.1 mol h 1 indicates that P n can often be 74 over-or underestimated. Conversely, if one wants to quantify the P n within a lower relative error of 50 i.e., 23.0 11.5 mol h 1 for this hypothetical greenhouse, for example, RMSE for N estimation must be 0.42 h 1 or lower, according to Eq. 7. Furthermore, if measurement uncertainties of V, C o , and C i , which have been ignored in this discussion, are taken into consideration, the required accuracy of N estimation will be much higher. Thus, although the above calculations are a rough estimate in a certain case and not necessarily quantitatively strict, the estimation accuracy of the NN model applied to the Fig. 1. The relationship between the measured numbers of air exchanges per hour N and estimated N using the neural network NN model for the "simultaneous" validation dataset white circles and "non-simultaneous" test dataset dark grey circles . The solid line is the 1:1 line. For the "simultaneous" validation dataset, RMSE = 0.48 h 1 and r 2 = 0.75; for the "non-simultaneous" test dataset, RMSE = 0.63 h 1 and r 2 = 0.52. "non-simultaneous" test dataset must be quite insufficient in terms of monitoring P n in a greenhouse. Nevertheless, looking at the time course of the diurnal measurements and estimation of N, the trend of measured and estimated N in the non-simultaneous test dataset changed daily. For example, the estimated N on February 6 fluctuated, but the measured N was fairly constant Fig. 2a . On the other hand, the measured and estimated N on February 8 agreed relatively well except at 11 : 00-12 : 00 and 14 : 00-15 : 00 Fig. 2b . No clear correlation was observed in the daily environmental conditions and estimation accuracy Fig. 2c-f . The factors underlying the variation in estimation accuracy have not yet been determined.
One of the reasons for the low estimation accuracy against a non-simultaneous test dataset can be the "covariate shift" that is common in supervised learning Shimodaira, 2000;Sugiyama, 2006;Sugiyama et al., 2014 . Covariate shift means that the distribution of covariates or model input variables differ between the training and test datasets, while the input-output relationship does not change. We hypothesized that the estimation accuracy can be improved if the distribution of model-related parameters in the training dataset is similar to that of the test dataset. As such parameter candidates, we adjusted the relative distribution of U, T i , T o , ∆T or I. The data used for training were sampled so that the distribution resembled to that of the test dataset Fig. 3, see Materials and Methods for details . When we adjusted the distribution of ∆T in the training dataset, the estimation accuracy of the NN model trained with the adjusted dataset was slightly improved for the non-simultaneous test dataset compared to the model trained with the raw dataset Fig. 4 . Adjusting the training dataset appears to improve the estimation accuracy of N. In practice, the relative distribution of ∆T in the test dataset is not known a priori at the time which the model was trained. Instead, the relative distribution of ∆T acquired in the target season of the previous years can be used as a reference.
Next, we applied this method to interseasonal data. In contrast to the intraseasonal data Fig. 4 , estimation accuracy was not improved by relative distribution adjustment Table 3 . Therefore, relative distribution adjustments worked for intraseasonal data, but not for interseasonal data.
In this work, the number of data points or the size of data used for model training may be small Table 1 , relative to the input-layer dimension of 42. This could lead, at least in part, to the observed lower accuracy of N estimation through insufficient model generalization. Indeed, an overfitting might occur in  Fig. 2c, e , a small fluctuation in one or more input variables or in their covariances might lead to the fluctuation of estimated N. On the other hand, measured and estimated N agreed better in February 8 Fig. 2b , although T i , T o , U, and I fluctuated more Fig. 2d, f than those in February 6. Thus, the reason for the possible overfitting was not identified. We selected the limited data size for model training, considering the year-round greenhouse production in reality; data collection for model training must be carried out in a cropless greenhouse between crop seasons, and the period should be limited in general to a month or shorter. To improve the estimation accuracy with the limited data size, application of feature extraction or feature reduction techniques such as the principal component analysis to model inputs might contribute. Nevertheless, the estimation accuracy of the NN model in the case of interseasonal applications was too low Table 3 , strongly suggesting that we need to employ a completely different approach to improve it, in addition to the input variable adjustments. Possible solutions toward the interseasonal NN model application includes continuous model correction while the model is running in a cropped greenhouse. Although it may be considered somewhat unrealistic, we can consider a miniature model greenhouse without plants, which is built alongside the target greenhouse and equipped with environment measurement and control system similar to that used for the target greenhouse. The vacant model greenhouse is operated similarly to the target greenhouse where crops are cultivated, while actual N is continuously measured for the model greenhouse. By utilizing such data obtained from the model greenhouse, the NN model under operation in the target greenhouse could dynamically be corrected.
In summary, the estimation accuracy of the NN model for N of a naturally ventilated greenhouse was low when the dataset used to train the model differed in time from the model test. Adjusting the relative distribution of ∆T used for training slightly improved the estimation accuracy when the model was applied at the time immediately following training in the same season. However, the estimation accuracy was not improved when the model was applied to periods in different seasons. The method proposed here can only be used to estimate N if model validation and testing are close in time. A novel dataset adjustment or fundamental improvement of the estimation procedure is needed to year-round practical use.