Journal of the Meteorological Society of Japan. Ser. II
Online ISSN : 2186-9057
Print ISSN : 0026-1165
ISSN-L : 0026-1165
Article
Nonlinear Data Assimilation by Deep Learning Embedded in an Ensemble Kalman Filter
Tadashi TSUYUKIRyosuke TAMURA
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML
J-STAGE Data

2022 Volume 100 Issue 3 Pages 533-553

Details
Abstract

Recent progress in the particle filter has made it possible to use it for nonlinear or non-Gaussian data assimilation in high-dimensional systems, but a relatively large ensemble is still needed to outperform the ensemble Kalman filter (EnKF) in terms of accuracy. An alternative ensemble data assimilation method based on deep learning is presented, in which deep neural networks are locally embedded in the EnKF. This method is named the deep learning-ensemble Kalman filter (DL-EnKF). The DL-EnKF analysis ensemble is generated from the DL-EnKF analysis and the EnKF analysis deviation ensemble. The performance of the DL-EnKF is investigated through data assimilation experiments in both perfect and imperfect model scenarios using three versions of the Lorenz 96 model and a deterministic EnKF with an ensemble size of 10. Nonlinearity in data assimilation is controlled by changing the time interval between observations. Results demonstrate that despite being such a small ensemble, the DL-EnKF is superior to the EnKF in terms of accuracy in strongly nonlinear regimes and that the DL-EnKF analysis is more accurate than the output of deep learning because of positive feedback in assimilation cycles. Even if the target of training is an EnKF analysis with a large ensemble or a simulation by an imperfect model, the improvement introduced by the DL-EnKF is not very different from the case where the target of training is the true state.

1. Introduction

Data assimilation in nonlinear or non-Gaussian systems has been a challenge in meteorology and other geosciences (Bocquet et al. 2010). For instance, it is well known that cumulus convection exhibits strong non-Gaussianity in data assimilation (e.g., Kondo and Miyoshi 2019; Kawabata and Ueno 2020). The ensemble Kalman filter (EnKF) is formulated under the Gaussian assumption and is close to optimal in weakly nonlinear regimes, but it does not work well if nonlinearity is strong. On the other hand, the particle filter (PF) does not need the Gaussian assumption, but the weight degeneracy had been preventing the use of the PF for high-dimensional data assimilation (Snyder et al. 2008; van Leeuwen 2009). However, this limitation is disappearing owing to recent developments in the PF, including the use of localization and hybrids with the EnKF (Farchi and Bosquet 2018; van Leeuwen et al. 2019). Despite this progress, a relatively large ensemble is still needed for the PF to outperform the EnKF (e.g., Penny and Miyoshi 2016). This may be plausible because non-Gaussian data assimilation needs some information on higher-order moments of probability density functions (PDFs). As for the four-dimensional variational method (4D-Var), Tsuyuki (2014) showed that the 4D-Var with a conventional cost function implicitly used a non-Gaussian prior PDF that evolved according to the Liouville equation (Ehrendorfer 1994) if a certain condition was satisfied, and that the difficulty caused by multiple minima could be alleviated by combining with the EnKF. The iterative ensemble Kalman filter/smoother (IEnKF/IEnKS) has been shown to be the missing link between the PF and the EnKF and 4D-Var, and can work very well with mild nonlinearity and generate a much better analysis than the abovementioned data assimilation methods (Sakov et al. 2012; Bocquet and Sakov 2013, 2014; Bocquet 2016). However, the IEnKF/IEnKS needs much larger computational cost because of the iterative application of the EnKF/EnKS and the use of a long assimilation window.

Recent developments in machine learning, in particular in deep learning (LeCun et al. 2015), have demonstrated impressive skills in various fields. Data-driven modeling, including data-driven parametrizations, based on machine learning has been extensively explored for improving simulations and predictions of nonlinear dynamical systems. Dueben and Bauer (2018) discussed the question of whether models based on deep learning and trained on atmospheric data could compete with weather and climate models based on physical principles. Reichstein et al. (2019) advocated a hybrid modeling approach in which physical process models were coupled with machine learning to further improve understanding and predictive ability in earth system science. Abarbanel et al. (2018) and Geer (2020) showed an equivalence in formulation between data assimilation and deep learning. Lists of literature of recent studies are available in Reichstein et al. (2019) and Chattopadhyay et al. (2020), for instance. Quite recently, the combination of data assimilation and machine learning has been explored to address sparse and noisy observations in data-driven modeling (Brajard et al. 2020a; Bocquet et al. 2020; Tomizawa and Sawada 2021; Gottwald and Reich 2021; Wikner et al. 2021), data-driven parametrizations (Brajard et al. 2020b), and model error correction (Farchi et al. 2021).

Research on the application of deep learning to data assimilation itself has also started. Arcucci et al. (2021) proposed a method for integrating variational data assimilation with deep learning, in which a recurrent neural network is trained on the state of a dynamical model and the result of data assimilation. Silva et al. (2021) proposed the use of a generative adversarial network to make predictions and to assimilate observations by using a low-dimensional space for the spatial distribution of the simulated state. However, it is difficult to directly apply those methods to data assimilation in high-dimensional systems, such as atmospheric models for numerical weather prediction.

In this study, we present an ensemble data assimilation method combining the EnKF and deep learning as an alternative to the PF for high-dimensional systems. The additional computational cost to assimilate observations is a small fraction of that of the EnKF. As a deep neural network (DNN) can learn a data assimilation method for a specific dynamical system and a specific observing system by training, we could expect this method to work with a relatively small ensemble size even in strongly nonlinear regimes. However, data assimilation in meteorology is generally a large-scale problem, and the background error covariance and the distributions of radar and satellite data change with the analysis time. The EnKF, as well as the PF and 4D-Var, can properly deal with this nonstationarity in data assimilation. On the other hand, deep learning is based on the minimization of the sum of errors over many samples. In addition, it would be difficult to provide sufficient information on the forecast error covariance to a DNN, because the feasible size of a DNN is limited, where we define the size of a DNN as the total number of weights including bias parameters to be optimized by training. If the output of a DNN is not well optimized for each analysis time, the analysis accuracy may deteriorate in assimilation cycles. However, as the EnKF does not work very well in strongly nonlinear regimes, we could expect data assimilation by deep learning to outperform the EnKF in such regimes.

The purpose of this study is to propose a nonlinear data assimilation method based on deep learning that is locally embedded in an EnKF and to investigate its performance through data assimilation experiments in both perfect and imperfect model scenarios using toy models. By applying deep learning in combination with an EnKF, we can reduce the size of a DNN and address the nonstationarity in data assimilation. This method is named the deep learning-ensemble Kalman filter (DL-EnKF).

The remainder of this paper is organized as follows. Section 2 introduces the DL-EnKF method. Section3 describes the design of experiments in both perfect and imperfect model scenarios. Section 4 presents the results of these experiments. A summary and discussion are mentioned in Section 5.

2. Method

As data assimilation is generally a large-scale problem, it is desirable to keep the size of a DNN as small as possible. For instance, the size of a feed-forward neural network with m layers with n nodes per layer is about n2(m-1) and a greater number of samples would be required for training. If we directly apply deep learning to data assimilation, the number of input nodes is at least the sum of the number of observations and the degrees of freedom of a numerical model, and the number of output nodes is the degrees of freedom of the model, whereas the number of nodes of a hidden layer is usually required to be larger than the number of input or output nodes. For high-dimensional systems such as atmospheric models, the size of a DNN would become too large to be stored in the memory of a computer and to prepare sufficient training samples. To apply deep learning to data assimilation for atmospheric models, we need to introduce a localization procedure and to train a DNN to have some versatility so that it is applicable to each grid point in a certain range of geographical areas.

Figure 1a shows the workflow of the DL-EnKF, in which deep learning is locally embedded in an EnKF. The “EnKF” box in this figure represents the analysis step of the EnKF, and the “Deep Learning” box consists of an ensemble of several DNNs. The inputs of DNNs to create the DL-EnKF analysis at a grid point are the EnKF analysis, forecast, observations, availability of observations in binary, and pseudo-observations that supplement missing observations. The EnKF analysis and forecast are the ensemble means of each ensemble. We do not explicitly use the information contained in the forecast ensemble except for the ensemble mean to reduce the size of DNNs. As observational data for which the DNN has input nodes may sometimes be missing, it is necessary to provide the information on the availability of observations to DNNs. The pseudo-observations are prepared by using the EnKF analysis and the observation operators of the missing observations. Those input data are extracted from a small domain centered at the analysis grid point. The radius of this domain is hereafter referred to as the input radius, and this value is assumed to be smaller than the covariance localization radius of the EnKF. According to Hsieh and Tang (1998), the DL-EnKF analysis is the average of outputs from the ensemble of DNNs. The individual outputs from DNNs would be scattered in phase space because of multiple minima of a loss function of deep learning, and we would probably obtain a better DL-EnKF analysis by averaging those individual outputs.

Fig. 1.

(a) Workflow of DL-EnKF and (b) workflow to train a DNN for DL-EnKF. See text for details.

The analysis ensemble {xam}Mm=1, where M is the ensemble size, is created by modifying the EnKF analysis ensemble {xaEnKF, m}Mm=1 such that its ensemble mean is equal to the DL-EnKF analysis xaDL-EnKF as follows:   

where xaEnKF is the EnKF analysis and α is a parameter for adjusting the spread of the analysis ensemble. If adaptive covariance inflation is used in the EnKF, we can set α to 1 as the effect of tuning α is almost canceled by this procedure. However, if we conduct ensemble forecasts using the analysis ensemble, we may need to adjust the value of α. The members of the analysis ensemble thus generated are evolved by the time integration of a numerical model to prepare the forecast ensemble for the next analysis time.

For the training of a DNN, we use the EnKF analysis and forecast provided by an EnKF run, as shown in Fig. 1b. The weights including bias parameters are optimized by reducing a loss function that measures a difference between the output of the DNN and the target of training. We prepare several DNNs by randomly initializing the weights before the training.

One of the reasons for including the EnKF analysis in the inputs of DNNs is that this analysis at a grid point contains some information on the forecast, observations, and forecast error covariance in a domain within the covariance localization radius, so that we can reduce the input radius of DNNs and implicitly utilize some information of the forecast error covariance. In addition, even if DNNs cannot deal with some observational data because the input nodes for these observations are absent, they are assimilated by the EnKF part of DL-EnKF and their information is partly provided to the deep learning part through the EnKF analysis.

We can prepare pseudo-observations by other methods. However, it is easily shown by the sequential assimilation method (e.g., Houtekamer and Mitchell 2001) that if observation errors are independent of each other, the additional assimilation of pseudo-observations thus generated does not change the EnKF analysis. Therefore, it can be considered that the pseudo-observations are assimilated in the EnKF part of DL-EnKF along with the real observations, and that the same observations, including the pseudo-observations, are provided to the deep learning part. In this sense, the method adopted in this study may be a natural choice, although it may not be optimal and spurious correlations and biases will be generated. We could expect that DNNs will learn to properly deal with this problem by training. Note that the pseudoobservations also provide some information contained in the EnKF analysis to the deep learning part.

Lawson and Hansen (2004) showed that an analysis ensemble generated by a deterministic EnKF tends to retain multi-modality that may appear in a forecast ensemble, whereas this is not the case for a stochastic EnKF. Therefore, a stochastic EnKF is better than a deterministic EnKF for generating an analysis ensemble of the DL-EnKF. However, it is well known that if the ensemble size is relatively small, a stochastic EnKF is inferior in terms of the accuracy of the ensemble mean because of random perturbations that are added to observations (e.g., Sakov and Oke 2008; Bowler et al. 2013), so we adopt a deterministic EnKF for the EnKF part in this study.

3. Design of experiments

3.1 Outline

The performance of the DL-EnKF is investigated through both perfect and imperfect model experiments using three versions of the 40-variable Lorenz 96 models (Lorenz 1996) and the serial ensemble square root filter (EnSRF; Whitaker and Hamill 2002), which is one of the deterministic EnKFs. The ensemble size of the serial EnSRF is set to 10, because we are interested in the performance of the DL-EnKF with a relatively small ensemble. In this and the next sections, the EnKF means the serial EnSRF unless otherwise stated. The purpose of the perfect model experiments is to clarify the basic performance of the DL-EnKF, whereas that of the imperfect model experiments is to gain insights into the performance of the DL-EnKF when applied to data assimilation in the real atmosphere.

The experiments consist of two phases: the training phase of DNNs and the test phase using data assimilation experiments. In the training phase, we run the models and the EnKF to prepare training and validation datasets, which are used to train DNNs and to verify the accuracy of the output of DNNs, respectively. The length of the period and time interval of these datasets are 1000 and 1, respectively. This large time interval is taken to ensure that each data point is almost independent of each other. In the test phase, we run the models to prepare a test dataset for the data assimilation experiments. The length of the period of this dataset is also 1000. The accuracy of the DL-EnKF analysis is compared with the deep learning and EnKF analyses to evaluate the performance of the DL-EnKF. The workflow to create the deep learning analysis is the same as that of the DL-EnKF analysis, except for the absence of feedback from the deep learning part to the EnKF part (Fig. 2). The analysis accuracy is evaluated by the root mean square error (RMSE) that is the square root of the squared error averaged over the grid points and the period of the test dataset at a time interval of 1.

Fig. 2.

Workflow to generate deep learning analysis.

In the perfect model experiments, we use the original 40-variable Lorenz 96 model and conduct two types of experiments, Exp-PA and Exp-PB, in which the targets of training are different. The target in Exp-PA is the true state generated by the model, whereas the target in Exp-PB is an analysis by the stochastic EnKF (Evensen 1994; Burgers et al. 1998) with an ensemble size of 1000. This analysis is hereafter referred to as the EnKF1000 analysis. Although this ensemble size may be unrealistic for the 40-variable model, the purpose of Exp-PB is to examine the performance of the DL-EnKF when an analysis with a high accuracy is used as a target.

In the imperfect model experiments, the two-scale Lorenz 96 model with 40 large-scale variables and 400 small-scale variables is used as a substitute of the real atmosphere, whereas a parameterized Lorenz 96 model with a parameterization of large-scale forcing by small-scale variables is used as a substitute of a numerical model of the real atmosphere. We conduct two types of experiments, Exp-IA and Exp-IB. In Exp-IA, we train DNNs using the simulation data generated by the parameterized model and conduct the data assimilation experiment using observations generated by the two-scale model. The idea behind Exp-IA is that if a dynamical system and an observing system used for the training of a DNN resemble real-world systems, we could expect that a data assimilation method the DNN has learned by training also works in real-world applications. In Exp-IB, the two-scale Lorenz 96 model is used for the training and data assimilation experiments in a perfect model scenario for comparison to Exp-IA.

Table 1 summarizes the models used in the training and test phases of the experiments. The following subsections describe further details of the experimental design.

3.2 Models

The governing equations of the Lorenz 96 model for the perfect model experiments are   

for k = 1, ..., K, satisfying periodic boundary conditions: X−1 = XK−1, X0 = XK, and X1 = XK+1. The number of variables K and the forcing parameter F are set to 40 and 8, respectively. Note that as the number of positive Lyapunov exponents of the model is 13 for those parameter values (Lorenz and Emanuel 1998), the ensemble size of 10 is not very small. The leading Lyapunov exponent corresponds to a doubling time of 0.42 (Lorenz and Emanuel 1998). When the nonlinearity in data assimilation is controlled by changing the time interval between observations as in this study, this value can be used as a reference for estimating the degree of nonlinearity. The fourth-order Runge–Kutta scheme is adopted for the time integration of the model with a time step 0.01.

The governing equations of the two-scale Lorenz 96 model for the imperfect model experiments are   

  
for k = 1, ..., K and j = 1, ..., J, where {Xk} are large-scale variables and {Yj, k} are small-scale variables, satisfying periodic boundary conditions: X−1 = XK−1, X0 = XK, X1 = XK+1, Y0,1 = YJ, K, YJ+1, K = Y1,1, and YJ+2, K = Y2,1. To make Eq. (4) meaningful, we further define Y0, k = YJ, k−1, YJ+1, k = Y1, k+1, and YJ+2, k = Y2, k+1. Large- and small-scale variables interact with each other through the last terms on the right-hand side of Eqs. (3) and (4). We set the parameters as follows: K = 40, J = 10, F = 10, h = 1, c = 10, and b = 10. These values are the same as those used by Lorenz (1996) except for K. Note that the forcing parameter F is larger than that in the perfect model experiments. The fourthorder Runge–Kutta scheme is adopted for the time integration of the model with a time step 0.005.

We also need the parameterized Lorenz 96 model in the imperfect model experiments. Although advanced parametrization methods, such as stochastic parametrization (e.g., Wilks 2005) and machine learning-based parametrization (e.g., Schneider et al. 2017), are available, a simple function fitting is adopted in this study; the last term on the right-hand side of Eq. (3) is approximated by a linear function of Xk. The reason we adopt such a simple approach is that we intend to demonstrate that even an unsophisticated imperfect model works well for the training of a DNN. Then, the governing equations of the parameterized Lorenz 96 model are   

for k = 1, ..., K, satisfying periodic boundary conditions: X−1 = XK−1, X0 = XK, and X1 = XK+1. The number of variables K and the forcing parameter F were set to 40 and 10, respectively, to be consistent with the twoscale Lorenz 96 model. The constants a1 and a0 are to be determined by the function fitting. The fourth-order Runge–Kutta scheme is adopted for the time integration of the model with a time step 0.01.

The three models are integrated from t = 0 to t = 2050 for preparing the training and validation datasets. The initial condition at each grid point is F plus an independent random number drawn from the normal distribution with the mean 0 and the variance 1, except that the small-scale variables of the two-scale Lorenz 96 model are set to 0 at the initial time. The data from t = 51 to t = 1050 are used for preparing the training dataset, and those from t = 1051 to t = 2050 for the validation dataset. The other time integration of the models from t = 0 to t = 1050 with initial conditions generated by using another random number sequence is conducted for preparing the test dataset, and the state variables from t = 51 to t = 1050 are used as the true state (target) for computing the analysis error.

3.3 Observations

Observations are generated by adding random errors to the results of the time integration of the models. The observation errors are independent random draws from the normal distribution with the mean 0 and the variance 1. Observations used in the imperfect experiments are large-scale variables of the two-scale Lorenz 96 model except for the training phase of Exp-IA, in which observations are prepared by the parameterized Lorenz 96 model.

Nonlinearity in data assimilation is controlled by changing the time interval between observations Δt. All experiments are performed for three values of Δt : 0.05, 0.20, and 0.50. The case of Δt = 0.05 corresponds to a weakly nonlinear case, and that of Δt = 0.50 corresponds to a strongly nonlinear case. Note that the latter value is close to the doubling time 0.42 mentioned in Section 3.2, and that Penny and Miyoshi (2016) used Δt = 0.50 for their experiments of a local PF. All observations are prepared such that observations at the same analysis time are the same, regardless of the time interval between observations.

For the spatial distribution of observations, two cases are examined. In one case, observations are available at all grid points, and the number of observations is always 40. In other words, observations are available at each grid point with a probability of 1. In the other case, observations are available at each grid point with a probability of 1/2. It is assumed that events at which an observation exists are independent of each other in space and time so that the spatial distribution of observations randomly changes at every observation time. The average number of observations is 20, and the standard deviation of the number of observations is . Hence, the number of pseudo-observations used by deep learning is about the same as that of observations. The probability of observations is hereafter denoted by p.

3.4 Data assimilation by EnKF

Covariance localization and covariance inflation are needed to optimize the performance of the EnKF. The correlation function defined by Eq. (4.10) of Gaspari and Cohn (1999) is taken for covariance localization. The parameter c in this equation is regarded as the localization radius rL (unit: grid intervals) in this study, at which radius the correlation coefficient decreases to 5/24. An adaptive inflation method proposed by Li et al. (2009) is used for multiplicative covariance inflation. This method is based on the innovation statistics by Desroziers et al. (2005). Li et al. (2009) imposed lower and upper limits in the “observed” inflation factor Δ̃o before applying a smoothing procedure: 0.9 ≤ Δ̃o ≤ 1.2. As we conduct data assimilation over a much wider range of the time interval between observations Δt, we optimize the upper limit of Δ̃o for each set of parameters (rL, Δt, p), leaving the lower limit at 0.9. The candidates of the upper limit are 1.2, 1.3, 1.4, 1.5, 2.0, 3.0, 5.0, and no limit. In addition, although Li et al. (2009) set the error growth parameter κ to 1.03, we adopt a larger value κ = 1.1 because this value leads to a better analysis in this study. A set of values of rL and the upper limit of Δ̃o with the best analysis accuracy is hereafter referred to as the optimal parameters. We determine the optimal parameters for each pair of (Δt, p) in Exp-PA and Exp-IA by data assimilation experiments using the target and observations in each training dataset. The optimal parameters thus determined are also used in Exp-PB and Exp-IB, respectively, unless otherwise stated.

In Exp-PB, the target of training is the EnKF1000 analysis yielded by the stochastic EnKF with an ensemble size of 1000, as mentioned in Section 3.1. The reason we adopt the stochastic EnKF is that when an ensemble size is very large, the accuracy of the serial EnSRF tends to deteriorate and to become less accurate than the stochastic EnKF. We can avoid this problem with the serial EnSRF by applying the mean-preserving random rotation of an analysis ensemble (Sakov and Oke 2008), but an additional computational cost is very large for the rotation of a 1000-member ensemble. Figure 3 compares the analysis accuracy of the two EnKFs for ensemble sizes of 10 (cold colors) and 1000 (warm colors), plotting the RMSEs averaged over the period from t = 51 to t = 1050. This result is obtained by using the target and observations in the training datasets of Exp-PA. The localization radius and the upper limit of Δ̃o are optimized in the case of the ensemble size 10, whereas no covariance localization is applied and the upper limit of Δ̃o is set to 1.2 in the case of the ensemble size 1000. Figure 3 reveals that in the latter case the RMSE of the stochastic EnKF is smaller than that of the serial EnSRF for all values of the time interval between observations Δt and probability of observations p. Note that the serial EnSRF with an ensemble size 10 outperforms the serial EnSRF with an ensemble size 1000 in the three cases: (Δt, p) = (0.05, 1), (0.05, 1/2), and (0.20, 1).

Fig. 3.

Comparison of RMSEs between the serial EnSRF (abscissa) and stochastic EnKF (ordinate) for the training dataset of Exp-PA. Dots in cold colors are for an ensemble size 10 and dots in warm colors are for an ensemble size 1000. Dots in dark colors are for the probability of observations 1 and dots in light colors are for the probability of observations 1/2. The three dots in the same color correspond to the observation time intervals 0.05, 0.20, and 0.50 from left to right.

In Exp-IB, the two-scale Lorenz 96 model is used to assimilate observations of large-scale variables. As noted by Tsuyuki (2019), when the ensemble size is relatively small, forecast correlations between large- and small-scale variables are not reliable. Hence, these forecast correlations are neglected in the EnKF, and the analysis ensemble of small-scale variables is left unchanged from the forecast ensemble at each analysis time.

3.5 Deep learning

A simple feedforward neural network with the same number of nodes for all hidden layers is adopted for the deep learning part of the DL-EnKF. As we assume that the input radius of DNNs is relatively small, a convolutional neural network would not be needed. We could use a recurrent neural network instead of the feedforward neural network to utilize the information obtained by the previous processing in deep learning, but it is not adopted in this study for simplicity.

The inputs of a DNN are the EnKF analysis using the optimal parameters, forecast, observations, availability of observations, and pseudo-observations in a small domain centered on an analysis grid point within the input radius rI (unit: grid intervals). The availability of observation at a grid point is set to 1 if the observation is available and to −1 if not available. The DNN assumes that observations are always available at all grid points, and we supplement missing observations with pseudo-observations. As the availability of observations is not necessary in the case of p = 1, the input layer of the DNN has 3(2rI + 1) nodes for p = 1 and 4(2rI + 1) nodes for p = 1/2. The number of hidden layers is set to 5 and the number of nodes per hidden layer is optimized, as mentioned later. As the balance of analysis (Kalnay 2002) is not a serious issue in Lorenz 96 models, we let the output of the DNN be the analysis value at the analysis grid point only. For Exp-IB, in which the two-scale Lorenz 96 model is used in the EnKF part of DL-EnKF, the input and output of the DNN are large-scale variables only.

Table 2 summarizes the architecture and training of the DNN. All input and output data except for the availability of observations are normalized by using the mean and standard deviation of the target state in the training dataset of Exp-PA for the perfect model experiments and of Exp-IB for the imperfect model experiments. As the statistical behavior of the models does not depend on the location of a grid point, the data at all grid points are used to prepare the training and validation datasets. Hence, the number of samples of each dataset is 40 × 1000 = 40000. The training dataset is split into small batches called mini-batches that are used to compute the loss function and update the weights of the DNN. Learning rate decay is adopted in the training to avoid the situation in which the DNN converges toward the minima in a noisy manner and ends up oscillating far away from the actual minima. The number of epochs is the number of times each element in the training dataset is used by the DNN for optimizing the weights. For most cases, iterations almost converge within several epochs. We use PyTorch (Paszke et al. 2019) as the deep learning software.

To determine the optimal number of nodes per hidden layer, we train two DNNs with 5 and 10 hidden layers by changing the number of nodes using the training and validation datasets of Exp-PA. Figure 4 plots the RMSEs of the two DNNs against the input radius rI for the case of Δt = 0.50 and p = 1. The RMSEs are computed by using the validation dataset for this case. For the DNN with 10 hidden layers and 5 nodes per hidden layer, the training fails for six values of the input radius, so the RMSE of this case is not plotted in Fig. 4b. As the RMSE is not very different between 5 and 10 hidden layers, we adopt the DNN with 5 hidden layers. Figure 4 shows that when the input radius and number of nodes are large to some extent, the RMSE tends to increase because of the generalization error of deep learning. As the DNN with 20 nodes per hidden layer has the smallest RMSE for most of the input radii, we set the optimal number of nodes to 20 for the case of Δt = 0.50 and p = 1.

Fig. 4.

Comparison of RMSEs of a DNN with 5 (light green), 10 (orange), 20 (red), 30 (green), 40 (cyan), and 50 (blue) nodes per hidden layer for the validation dataset of Exp-PA. The RMSEs are plotted against the input radius. The number of hidden layers is (a) 5 and (b) 10. The observation time interval is 0.50 and the probability of observations is 1.

The optimal numbers of nodes of the DNN with 5 hidden layers are summarized in Fig. 5 by blue bars for p = 1 and by cyan bars for p = 1/2. They tend to increase as the time interval between observations increases because the estimation of state variables becomes more difficult as nonlinearity increases. Although this result is obtained for Exp-PA, those numbers of nodes are used in all experiments. We also compute the optimal numbers of nodes for the case where the EnKF analysis is not included in the inputs of the DNN. The RMSEs for this case are plotted in Fig. 5 by red bars for p = 1 and by orange bars for p = 1/2. The inclusion of the EnKF analysis tends to reduce the optimal number of nodes. This is probably because the EnKF analysis plays the role of a first guess and makes it easier to estimate state variables.

Fig. 5.

The optimal number of nodes per hidden layer of a DNN with 5 hidden layers. The abscissa is the observation time interval. Blue and cyan bars are for the case of including EnKF analysis in input for the probability of observations 1 and 1/2, respectively. Red and orange bars are for the case of not including EnKF analysis in input for the probability of observations 1 and 1/2, respectively.

The appendix discusses the impacts of the increase in the ensemble size of EnKF and the sample size for training on the accuracy of output of a DNN. The result of the experiments shows that when the ensemble size of EnKF is increased to 40, the improvement by deep learning is considerably reduced, and that we need to increase the sample size much further to obtain a larger improvement.

3.6 Data assimilation by DL-EnKF

In the data assimilation experiments with the DL-EnKF, the EnKF part of DL-EnKF adopts the optimal parameters, and the DL-EnKF analysis is the average of outputs of 5 or 10 DNNs. As adaptive covariance inflation is used in the EnKF part, the parameter α in Eq. (1) is set to 1. In the test phase of Exp-IB, the deep learning part receives only large-scale variables from the EnKF part and generates the DL-EnKF analysis of large-scale variables. The EnKF analysis ensemble of large-scale variables is modified by using this analysis, whereas the analysis ensemble of small-scale variables is left unchanged from the one generated by the EnKF part. The RMSEs of the DL-EnKF, deep learning, and EnKF analysis in Exp-IB are computed by using large-scale variables only.

4. Results

4.1 Perfect model experiments

The first issue to be clarified is whether deep learning can outperform the EnKF in terms of analysis accuracy. The EnKF is close to optimal in weakly nonlinear regimes, and the deep learning part of DL-EnKF does not explicitly utilize the forecast error covariance. Figure 6a compares the analysis accuracy between deep learning and the EnKF in Exp-PA for all values of Δt and p, in which the RMSEs are plotted against the RMSE of EnKF for the input radius of 2 grid intervals. The dots indicate the RMSEs of a single DNN, and the short horizontal bars indicate the RMSE of the average of outputs from 5 DNNs. This figure shows that all RMSEs of a single DNN are the same for each case and that deep learning outperforms the EnKF when Δt = 0.50. Note that as the EnKF analysis is included in the inputs of DNNs, the accuracy of deep learning analysis does not become worse than that of the EnKF analysis if sufficient training samples are available.

Fig. 6.

Comparison of RMSEs between (a) EnKF (abscissa) and deep learning with 5 DNNs (ordinate), (b) EnKF and DL-EnKF with 5 DNNs, (c) EnKF and deep learning with 10 DNNs, and (d) EnKF and deep learning with 10 DNNs for Exp-PA. The probability of observations 1 is in blue and 1/2 in red, and the input radius is 2 grid intervals. Dots indicate RMSEs based on a single DNN, and short horizontal bars indicate RMSEs based on an ensemble of DNNs. The three groups of dots and a horizontal bar in the same color correspond to the observation time intervals 0.05, 0.20, and 0.50 from left to right.

The second issue is whether the accuracy of the DL-EnKF analysis is better than that of the EnKF and deep learning analyses. As mentioned in the introduction, the analysis by deep learning is based on the minimization of the sum of errors over many samples and not optimized for each analysis time. Hence, the analysis accuracy may deteriorate during assimilation cycles. Figure 6b compares the analysis accuracy between the DL-EnKF and EnKF for Exp-PA. The dots indicate the RMSEs of DL-EnKF when the output of a single DNN is used as the DL-EnKF analysis, and the horizontal bars indicate the RMSEs when the average of outputs from 5 DNNs is used as the DL-EnKF analysis. When Δt = 0.05 the RMSEs of DL-EnKF based on a single DNN are scattered and larger than that of EnKF. In other words, the accuracy of the deep learning analysis shown in Fig. 6a is not maintained during assimilation cycles in a weakly nonlinear case. Taking the average over 5 DNNs does not improve the accuracy very much. When Δt = 0.20 and 0.50, on the other hand, the RMSEs of DL-EnKF based on a single DNN become almost the same for each case and taking the average over 5 DNNs slightly improves the accuracy in the case of Δt = 0.50. In addition, a comparison of Figs. 6a and 6b shows that the RMSE of DL-EnKF is smaller than that of deep learning when Δt = 0.50 because of positive feedback in assimilation cycles.

We conduct additional experiments in which the ensemble size of DNNs is increased to 10 in Exp-PA. The initial conditions of weights used for the training are different from those used in the case of 5 DNNs. Results are presented in Figs. 6c and 6d, and the former figure looks the same as Fig. 6a. The benefit of taking the average over 10 DNNs for the DL-EnKF is evident when Δt = 0.05, although its analysis accuracy is still lower than that of EnKF. When Δt = 0.20 and Δt = 0.50, the RMSEs of DL-EnKF are almost the same as in the case of 5 DNNs. This suggests that an ensemble size of 5 is sufficient except for a weakly nonlinear case. Then, all the results shown below are based on the average of outputs from 5 DNNs, because our interest is primarily in the performance of the DL-EnKF in strongly nonlinear regimes.

Figure 7 compares the time sequences of RMSEs of the EnKF (red line) and the DL-EnKF (green line) in the case of p = 1. When Δt = 0.05 (Fig. 7a), the DL-EnKF is outperformed by the EnKF during the whole period. When Δt = 0.20 (Fig. 7b), the analysis accuracy of the two methods is close; the correlation coefficient between the two RMSEs computed for the period from t = 51 to t = 1050 is 0.761. When Δt = 0.50 (Fig. 7c), the EnKF sometimes exhibits a significant deterioration of accuracy, but the DL-EnKF does not show such a tendency. This result demonstrates an excellent performance of the DL-EnKF in strongly nonlinear regimes.

Fig. 7.

Time sequences of RMSEs of EnKF (red lines) and DL-EnKF (blue lines) for observation time intervals (a) 0.05, (b) 0.20, and (c) 0.50 for Exp-PA. The probability of observations is 1 and the input radius is 2 grid intervals.

The third issue is whether the optimal input radius of deep learning is smaller than the optimal localization radius of the EnKF. Figure 8 plots the RMSEs of EnKF (orange lines), deep learning (green lines), and DL-EnKF (blue lines) analysis against the input radius for all cases of Exp-PA (solid lines) and Exp-PB (broken lines). The RMSE of EnKF in Exp-PB is the same as that in Exp-PA. An orange broken line indicates the RMSE of the EnKF1000 analysis used for the training in Exp-PB. The optimal localization radius is plotted by a red arrow, except for the case of (Δt, p) = (0.05, 1/2) where the optimal localization radius is 11 grid intervals. The RMSEs of EnKF and deep learning overlap in Figs. 8a and 8b, and the RMSEs except for the EnKF1000 analysis almost overlap in Figs. 8c and 8d.

Fig. 8.

Comparison of RMSEs of EnKF (orange lines), deep learning (green lines), and DL-EnKF (blue lines) for Exp-PA (solid lines) and Exp-PB (broken lines). An orange broken line indicates the RMSE of EnKF1000 analysis used for training in Exp-PB. The RMSEs are plotted against the input radius, and a red arrow indicates the optimal localization radius of EnKF. The observation time interval and the probability of observations are (a) 0.05 and 1, (b) 0.05 and 1/2, (c) 0.20 and 1, (d) 0.20 and 1/2, (e) 0.50 and 1, and (f) 0.50 and 1/2, respectively.

When Δt = 0.05 (Figs. 8a, b), the DL-EnKF is outperformed by the EnKF. Reflecting that an ensemble of 5 DNNs is not sufficient (see Fig. 5b), the graphs of the DL-EnKF are not smooth because of large sampling errors. When Δt = 0.20 (Figs. 8c, d), the two data assimilation methods exhibit almost the same accuracy, but when Δt = 0.50 (Figs. 8e, f), the DL-EnKF outperforms the EnKF irrespective of the input radius. The accuracy of the DL-EnKF analysis is higher than that of the deep learning analysis for the latter case because of positive feedback in assimilation cycles. We can conclude that the input radius of 2 grid intervals is sufficient to attain the best accuracy of the DL-EnKF analysis. This value is smaller than the optimal localization radii for both p = 1 and p = 1/2. Even if the input radius is further increased, the accuracy of the DL-EnKF and deep learning analyses remains almost the same, although slight degradations are seen because of the generalization error of deep learning. This small sensitivity of RMSEs on the input radius indicates that the information at distant grid points contributes little to the DL-EnKF analysis even within the localization radius of the EnKF. The inclusion of the EnKF analysis in the inputs of DNNs may also contributes to this insensitivity.

Another point to be noted in Figs. 8e and 8f is that even if DNNs are trained on the EnKF1000 analysis, the accuracy of the DL-EnKF analysis is not very different from the one trained on the true state. Given the large errors of the EnKF1000 analysis shown in Figs. 8e and 8f, this result may look surprising. That is probably because this analysis well represents the basic dynamics of the Lorenz 96 model, despite the large errors. If the difference in the ensemble size between the DL-EnKF and the target analysis is decreased, the accuracy of the DL-EnKF analysis in Exp-PB deteriorates more. For instance, according to an additional experiment in which the ensemble size of the DL-EnKF is set to 40 for the case of Δt = 0.50 and p = 1 (see the appendix), the RMSEs of the EnKF analysis, DL-EnKF analysis in Exp-PA, and DL-EnKF analysis in Exp-PB are 0.682, 0.617, and 0.638, respectively, for the input radius of 2 grid intervals. The corresponding values for the ensemble size of 10 are 0.798, 0.675, and 0.689 (see Fig. 8e), so the deterioration of accuracy in Exp-PB is still not very large.

Finally, we examine the impact of including the EnKF analysis in the inputs of DNNs on the accuracy of the deep learning analysis using the test datasets of Exp-PA. Figure 9 plots the RMSE of deep learning in which the EnKF analysis is not included (cyan lines) and the RMSE in which the EnKF analysis is included (green lines) against the input radius. The green lines are the same as in Fig. 8, and the two RMSEs overlap in Fig. 9f. For comparison, the RMSE of EnKF of which the localization radius is not optimized is also plotted by an orange line against the localization radius with the upper limit of Δ̃o optimized for each localization radius. Note that the RMSE of EnKF does not always attain the minimum at the optimal localization radius, indicated by a red arrow, because the values of the optimal parameters are determined by using the training datasets.

Fig. 9.

Comparison of RMSEs of EnKF (orange line), deep learning not including EnKF analysis in inputs (cyan line), and deep learning including the EnKF analysis in inputs (green line) for Exp-PA. The RMSE of EnKF is computed for each localization radius. The abscissa is the input radius for deep learning and the localization radius for the EnKF. A red arrow indicates the optimal localization radius. The observation time interval and the probability of observations are (a) 0.05 and 1, (b) 0.05 and 1/2, (c) 0.20 and 1, (d) 0.20 and 1/2, (e) 0.50 and 1, and (f) 0.50 and 1/2, respectively.

We can see from Fig. 9 that the accuracy of the deep learning analysis is improved by including the EnKF analysis in the inputs of DNNs except for Fig. 9f, in which the EnKF analysis is too inaccurate to be useful. Figures 9c and 9e also show that this procedure reduces the dependence of the analysis accuracy on the input radius. This is because the EnKF analysis contains some information on the forecast ensemble and observations in a domain within the localization radius. Such a reduction in the dependence brought about by including the EnKF analysis is not clearly seen in Figs. 9d and 9f for p = 1/2, as deep learning partly utilizes the EnKF analysis through pseudo-observations.

4.2 Imperfect model experiments

The parametrization procedure for the parameterized Lorenz 96 model is described in Section 3.2. Figure 10 is the scatter plot between the large-scale variables and the large-scale forcing by small-scale variables. The initial condition is the same as that used for preparing the training dataset of Exp-IB and those data from t = 51 to t = 1050 at a time interval of 1 are plotted. The number of samples is 40000 and the result of linear function fitting is plotted by a straight line. The values of constants in Eq. (5) are a1 = −0.320 and a0 = −0.165. As the slope of this line is negative, the forcing acts on large-scale variables as negative feedback. Figure 11 compares the Hovmöller diagrams of the Lorenz 96 model, parameterized Lorenz 96 model, and large-scale variables of the two-scale Lorenz 96 model. The initial condition is the same as that used in Fig. 10. Note that the forcing parameter F is larger than that in the perfect model experiments. A comparison of the three panels in Fig. 11 shows that the parameterization works well, although the parametrized Lorenz 96 model evolves a little more regularly than the two-scale Lorenz 96 model. Stochastic parameterizations could remedy this defect (Wilks 2005).

Fig. 10.

Scatter plot between large-scale variables (abscissa) and large-scale forcing by small-scale variables (ordinate) of the two-scale Lorenz 96 model. A solid line is the result of linear function fitting.

Fig. 11.

Hovmöller diagrams of (a) the Lorenz 96 model, (b) the parametrized Lorenz 96 model, and (c) large-scale variables of the two-scale Lorenz 96 model.

Figure 12 plots the RMSEs of EnKF (orange lines), deep learning (green lines), and DL-EnKF (blue lines) analyses against the input radius for all cases of Exp-IA (solid lines) and Exp-IB (broken lines). Note that Exp-IA is conducted in an imperfect model scenario, whereas Exp-IB is conducted in a perfect model scenario for comparison. Unlike Fig. 8, an orange broken line indicates the RMSE of EnKF using the two-scale Lorenz 96 model. The optimal localization radius of the EnKF is indicated by a red arrow. The RMSEs of EnKF and deep learning for each case overlap in Figs. 12a–d. The RMSE of EnKF using the two-scale Lorenz 96 model is found to be smaller than the RMSE using the parameterized Lorenz 96 model. We can confirm that the basic performance of the DL-EnKF is the same as that in the perfect model experiments; the DL-EnKF is inferior to the EnKF in a weakly nonlinear case (Figs. 12a, b), whereas the opposite is true in a strongly nonlinear case (Figs. 12e, f), in which the optimum input radius is smaller than the optimum localization length. A difference from the perfect model experiments is that when Δt = 0.20 (Figs. 12c, d), the accuracy of the DL-EnKF analysis is a little worse than that of the EnKF analysis for p = 1 and a little better for small values of the input radius for p = 1/2.

Fig. 12.

Same as Fig. 8, except for Exp-IA (solid lines) and Exp-IB (broken lines) and that an orange broken line indicates the RMSE of EnKF using the two-scale Lorenz 96 model.

An important point to be noted in Figs. 12e and 12f is that even if DNNs are trained on the training dataset prepared by the parameterized Lorenz 96 model, the improvement in analysis accuracy introduced by the DL-EnKF is not very different from the case where the training dataset is prepared by the two-scale Lorenz 96 model. The former model is run in the data assimilation experiments in the test phase of Exp-IA without any trouble, implying that this model well represents the basic dynamics of large-scale variables of the two-scale Lorenz 96 model. When we use the Lorenz 96 model with F = 10, of which evolution is shown in Fig. 11a, in the above data assimilation experiments, we often experience failures.

5. Summary and discussion

An ensemble data assimilation method based on deep learning was presented, in which an ensemble of DNNs was locally embedded in an EnKF. This method was named the DL-EnKF. The inputs of a DNN were the EnKF analysis, forecast, observations, availability of observations, and pseudo-observations in a small domain centered on an analysis grid point. Missing observations were supplemented with the pseudo-observations created from the EnKF analysis. The DL-EnKF analysis was the average of outputs from an ensemble of DNNs. The DL-EnKF analysis ensemble was generated from the DL-EnKF analysis and the EnKF analysis deviation ensemble. The members of the DL-EnKF analysis ensemble thus generated were evolved by the time integration of a numerical model to prepare the forecast ensemble for the next analysis time.

The performance of the DL-EnKF was investigated through data assimilation experiments in both perfect and imperfect model scenarios using three versions of the Lorenz 96 model and the serial EnSRF with an ensemble size of 10. The target of training in the perfect model experiments was the true state generated by the Lorenz 96 model or the EnKF1000 analysis prepared by the stochastic EnKF with an ensemble size of 1000. In the imperfect model experiments, the true state and observations were provided by the two-scale Lorenz 96 model, whereas the training dataset was prepared by using the parameterized Lorenz 96 model. Nonlinearity in data assimilation was controlled by changing the time interval between observations.

The DL-EnKF was outperformed by the serial EnSRF in a weakly nonlinear case, but it was superior to the serial EnSRF in terms of analysis accuracy in a strongly nonlinear case, despite such a small ensemble size. The DL-EnKF analysis was more accurate than the output of deep learning because of positive feedback in assimilation cycles in the latter case. Even if the target of training was the EnKF1000 analysis or the simulation by the parametrized Lorenz 96 model, the improvement introduced by the DL-EnKF was not very different from the case where the target of training was the true state. The inclusion of EnKF analysis in the inputs of a DNN not only improved the accuracy of the deep learning analysis but also reduced the optimal number of nodes per hidden layer and the dependence of the accuracy on the input radius.

Although the above results were obtained from experiments using toy models, they suggest that the DL-EnKF may be a promising method for data assimilation in strongly nonlinear regimes. The DL-EnKF works with a smaller ensemble size than the PF, and we can prepare a training dataset for deep learning from simulation data by a numerical model used in data assimilation. Observational data and EnKF analysis data generated with a large ensemble could be used for this purpose, but a huge computational cost may be needed to obtain sufficient samples and the period when observations are available is limited.

The DL-EnKF may be suitable for data assimilation in cloudy or convective regions in the atmosphere to assimilate radar and satellite observations. We need to extend the inputs and output of a DNN in the vertical to assimilate satellite radiance data, as they are nonlocal observations. As for radial wind data by a Doppler radar, the direction and distance of a radar site differ depending on a grid point. However, if the radar site position relative to the grid point is included in the inputs of a DNN, we can train the DNN collectively regardless of the grid point as in this study.

A couple of issues need to be addressed before applying the DL-EnKF to data assimilation in the atmosphere. In the data assimilation experiments using Lorenz 96 models, the analysis value at a single grid point is sufficient for the output of a DNN, but we need to take the balance of analysis into account for atmospheric data assimilation. One of the methods for ensuring the balance is to extend the output of a DNN to include analysis values at surrounding grid points. Then, the target of training consists of the target state in a small domain centered on an analysis grid point. As the target state is usually well balanced, the DNN could learn the balance. Adding a penalty term for suppressing imbalance to a loss function of the DNN may help enhance the balance. In addition, taking a weighting average of the outputs in adjacent domains may be effective in improving analysis accuracy.

This study demonstrates that the DL-EnKF is inferior to the EnKF in a weakly nonlinear case. It is found that an increase in the ensemble size of DNNs can mitigate this problem, but it would be difficult to increase the ensemble size sufficiently, given the computational cost needed for the training of DNNs. We may need a criterion for replacing the EnKF analysis with the corresponding DL-EnKF analysis in DL-EnKF assimilation cycles. An advanced DNN, such as a recurrent neural network, and some information on the forecast error covariance for input would be useful for improving the performance of DL-EnKF in weakly nonlinear regimes as well as in strongly nonlinear ones.

Data Availability Statement

The Python programs used for Exp-PA in this study are available on J-STAGE Data, https://doi.org/10.34474/data.jmsj.19136891.

Acknowledgments

The authors thank Dr. Arata Amemiya (RIKEN) for his insightful suggestion on the target of training. The comments of two anonymous reviewers were very helpful to improve the manuscript. This work was supported by the Ministry of Education, Culture, Sports, Science and Technology through the program for Promoting Research on the Supercomputer Fugaku (Large Ensemble Atmospheric and Environmental Prediction for Disaster Prevention and Mitigation; hp200128, hp210166), and by the Japan Society for the Promotion of Science through KAKENHI (Study on Uncertainty of Cumulonimbus Initiation and Development Using Particle Filter; JP17H02962).

Appendix

In this study, we perform the experiments using the EnKF with 10 members and 40000 training samples. It may be of interest to examine how the accuracy of output of a DNN changes when the ensemble size and sample size are increased. This appendix presents some results of additional experiments in which the ensemble sizes of EnKF are set to 10 and 40 and the sample sizes are set to 40000, 160000 and 640000. These experiments correspond to Exp-PA for Δt = 0.50. The ensemble size of 40 is the same as the degrees of freedom of the Lorenz 96 model and, according to Fig. 5 of Penny and Miyoshi (2016), an EnKF still outperforms a local PF with this ensemble size. The periods of time integration of the model for preparing the training and validation datasets for the three sample sizes are 2050, 8050, and 32050, respectively, with the first periods of 50 in length discarded.

Figure A1 shows the optimum numbers of nodes per hidden layer of a DNN, obtained by using the validation datasets. The maximum number of nodes is limited to 100. Although we choose the number of nodes that performs the best for the various input radius, the determination of the optimal number becomes difficult with the increase of the sample size. When the number of training samples is increased, the generalization error of deep learning tends to reduce and, therefore, the optimal number of nodes per hidden layer tends to increase.

Fig. A1.

The optimal number of nodes per hidden layer of a DNN with 5 hidden layers for the time interval between observations of 0.50. The abscissa is the number of samples. Blue and cyan bars are for the EnKF ensemble size of 10 for the probability of observations 1 and 1/2, respectively. Red and orange bars are for the EnKF ensemble size of 40 for the probability of observations 1 and 1/2, respectively.

Figure A2 compares the RMSE between the EnKF and the output of a DNN obtained by using the test datasets of Exp-PA. Note that they are not the average over 5 DNNs, so the RMSEs shown by green lines in Figs. A2a and A2b are different from those in Figs. 8e and 8f, respectively. The optimal localization radius of the serial EnSRF in Fig. A2c is 12 grid intervals. It is found from this figure that the improvement by deep learning is considerably reduced for the ensemble size of 40. When the sample size is increased, the RMSE of the output of a DNN is reduced, but we need much more training samples to obtain a large improvement.

Fig. A2.

Comparison of the RMSE between EnKF (orange lines) and the output of a DNN with the number of samples of 40000 (green lines), 160000 (blue), and 640000 (cyan) for the observation time interval of 0.50. The ensemble size of EnKF and the probability of observations are (a) 10 and 1, (b) 10 and 1/2, (c) 40 and 1, and (d) 40 and 1/2, respectively. The RMSEs are plotted against the input radius, and a red arrow indicates the optimal localization radius of EnKF.

Data Availability Statement

The Python programs used for Exp-PA in this study are available on J-STAGE Data,


References
 

© The Author(s) 2022. This is an open access article published by the Meteorological Society of Japan under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
https://creativecommons.org/licenses/by/4.0/
feedback
Top