Ridge Estimate Application to Growth Function

Growth functions are often used to describe longitudinal growth processes. In growth analysis, it is first necessary to estimate the parameters included in the growth function based on real life data. During the numerical estimation process, this is often accomplished using iterative algorithm. The initial value setting is the key to this iterative algorithm process. An improper setting of the initial value is risky, because it may lead to non-convergence. Another element of risk is the “convergence of an improper solution” without an error message. Under such a scenario, because the estimation process is correctly completed, it is difficult to identify the source of the error. To resolve this problem, we focus our attention to ridge estimation and how it applies to growth analysis. Ridge estimation method can be used to stabilize the estimation process by imposing the l2-norm condition as a constraint. In this paper we evaluated the performance of the ridge estimation technique using two cases of numerical experiments. Under several settings of the initial value, we were able to evaluate the behavior of the estimate, especially for dispersion. We concluded that the estimate from the ridge estimation process is more stable compared to the residual sum of square approach. Actually the mean squared error value in ridge estimate is smaller than the residual sum of square. Although ridge estimation method is not universal, this paper has shown us that it is possible to extend the range of setting initial values that converge to a proper estimate.


Introduction
Growth curves are found in a wide range of disciplines, such as biology, forestry and fishery research. Most living matter grows with successive lags, growth, and asymptotic phases; examples of quantities that follow such growth curves are tree growth and the extent of a population of fish. The role of the growth function is not only to describe the growth process, but to remove random fluctuations from the actual growth behavior and aggregate its characteristics into a function. The growth function is expected to reflect the underlying growth process as well as have high predictive ability. With the understanding that most growth curves are sigmoidal in shape, various growth functions have been proposed depending on the assumptions and constraints for longitudinal behavior. Zeide (1993) compared and investigated 12 types of growth functions. Each of these growth functions has parameters which characterize the property of the function. In the growth analysis, it is first necessary to determine the parameters based on real life growth data. Basic statistical approaches like least squared method or maximum likelihood method are employed in parameter estimation. In a numerical estimation process, the results are obtained through iterative algorithm (Nelder and Mead 1965). These iterative algorithms need a proper initial value-setting. If an improper initial value is set, there is a high risk that the iterative process would not converge. When this happens, the problem can easily be recognized because it will give rise to an error message, etc. However, there is the possibility that it may lead to an unfortunate situation where there may be "convergence of an improper solution" without an error message, even though the estimation process was correctly completed. On a small scale analysis, the error or defect can visually be identified through careful observations, however, with large scale analysis there is a high risk of overlooking the error, because the solution appears to have been converged. This scenario is explained in the example that follows: Table 1 shows two types of virtual growth data for ages: 2, 3, ...,10. One of which represents a projected growth curve generated using the Bertalanffy growth function (Bertalanffy 1938), and the other, from the Gompertz growth function (Gompertz 1825). In this example, our focus is on the observations for the Bertalanffy growth function (see plot in Figure 1). We tried to fit the Bertalanffy growth function to the observation in Table 1, where t represents the independent variable related to the time element, and θ = (α, β) ′ is the parameter which characterizes the property of the function. The parameter α corresponds to the upper bound for the growth curve and β corresponds to the speed of the growth curve. Setting the initial value properly (for example (α, β) = (90, 0.3)) leads to a proper or true estimate of (α,β) = (102.258, 0.393) (dotted line "Proper estimate" in Figure 1).
Here the hat-symbol denotes the estimate and underbar-symbol denotes the initial value. On the other hand, an improper setting of initial value (for example (α, β) = (20, 5)) leads to an improper estimate of (α,β) = (67.152, 12.657) (dotted line "Improper estimate" in Figure 1). It should be noted here that even in the latter estimate, the iterative process converges and no error is reported (We use "optim" function in R. When we use "nls" function, the error occurs). In other words, the fact that the estimates improperly converged is troublesome in that it is not revealed unless under a careful observation. This is an extreme and easy-to-understand example, but other initial values may lead to local convergence that is not so far from the true estimate. In this paper, we look at this problem through the lens of growth analysis. The above observations were made by using R, and the related script is shown in the Appendix. We set two situations, one is Bertalanffy y = 100(1 − e −0.4t ) 3 , another is Gompertz y = 100exp(−5e −0.5t ). From these true settings, we create the virtual growth observation at age 2, 3, ..., 10 by adding random errors from N (0, 10) to the expected values.
In order to solve such a problem, we focus our attention on the ridge estimate proposed in Hoerl and Kennard(1970). In this paper, ridge estimate is a concept that attempts to stabilize the estimation process by imposing the ℓ 2 -norm condition as a constraint. Ridge estimate is often applied when the number of samples is small or when the explanatory variables are highly correlated with each other, thus making the estimate tends unstable. For example, in linear regression model y = Xβ + ε, the estimate of parameter β can be obtained from the expression:β = (X ′ X) −1 X ′ y. On the other hand, in ridge estimate, we adopt ridge parameter λ > 0 and the estimate can be obtained from the expression:β λ = (X ′ X + λI p ) −1 X ′ y (Hoerl and Kennard 1970). Here, y, β and ε are the vectors of the response variable, parameter and error, and X and I p are the matrices of the explanatory variable and p-dimensional unit matrix, respectively. Here, we apply this concept to the growth function. A detailed description of the mathematical formulation is shown in the section that follows.

Growth Model and Parameter Estimate
Let us describe the growth model as where y (t i ) is the growth amount at a given time t i , E[·] represents the expected value, and n is the number of observations over time. On the right-hand side of equation [2], f (·) is the growth function and θ is a p-dimensional unknown parameter vector included in the growth function. Since the residual at time t i is expressed as e i (θ) = y (t i ) − f (t i | θ), then parameter θ can be estimated by minimizing residual sum of square (RSS) that is,θ = arg min θ RSS(θ). Hereafter, we refer to such estimate as RSS-estimate. The valuê θ is known to have desirable asymptotic properties, consistency, asymptotic unbiasedness, and asymptotic normality, under some proper conditions.

Ridge Estimation
In ridge estimation, we consider the following PRSS (penalized residual sum of square) Here λ > 0 is ridge parameter and ∥ · ∥ is ℓ 2 -norm. Then, ridge estimation for the parameter in growth function is expressed asθ λ = arg min θ PRSS(θ | λ). Hereafter, we refer to such estimate as Ridge-estimate. Let g(t | θ) and H(t | θ) be the p-dimensional gradient vector and the p × p Hessian matrix of f (t | θ), respectively, which are defined as Note that Hessian matrices of [3] and [4] are n{I(θ) − J (θ)} and n{I(θ) − J (θ)} + λI p . These Hessian matrices indicate that [3] does not guarantee convexity with respect to θ, while [4] does, when λ is greater than a certain value. We can also see from this result that the ridge estimation will be more stable than the RSS-estimation, because for convex functions there is only one local minimum. Unfortunately, it is widely known thatθ λ has a bias. However, under a fixed λ, the difference betweenθ andθ λ converges to 0 with probability as n → ∞, because the following approximation holds: This indicates thatθ λ will have the same desired asymptotic properties asθ under a suitable λ.
The equation [7] is derived as follows: Sinceθ λ is the minimizer of [4], the following equation is satisfied: where 0 p is the p-dimensional vector of zeros. By applying Taylor expansion aroundθ to the equation [8] and using the result 0 p = − ∑ n j=1 e j (θ)g(t j |θ) derived from the equation [8] with λ = 0, we obtain equation [7].
Suppose y(t 1 ), ...., y(t n ) are independent random variables and y(t i ) is distributed according to a normal distribution with mean, f (t i | θ) and variance, σ 2 . In order to determine ridge parameter λ, we minimize the following generalized cross-validation (GCV) criterion with the generalized degree of freedom (GDF) proposed by Ye (1998) as where d λ is the generalized degree of freedom, which is defined as d λ = ∑ n i=1 ∂f (t i |θ λ )/∂y i . It is well known that the original GCV criterion was proposed by Craven and Wahba (1979). The GDF for the non-liner regression was derived by Kamo and Yoshimoto (2013). By using the GDF, Fukui, Yamamura and Yanagihara (2015) proposed the GCV criterion with GDF, for selecting growth functions. The GDF in the ridge estimation is obtained by the same derivation as in Kamo and Yoshimoto (2013). We will derive the specific from of d λ in the following. Differentiating both sides of [8] with respect y i to yield The equation [10] implies that Hence, we have Consequently, the GDF in the ridge estimation is given by , where tr is a trace operator.

Numerical Examinations
In this section, we present the results for applying the above procedures to artificially generated observations presented in Table 1. These observations were created by adding errors to the two growth functions; Bertalanffy and Gompertz. These errors were generated from N (0, 10). The Bertalanffy function is defined in equation [2] and the Gompertz function is defined as: We set up two scenarios based on the true parameter, (α, β) = (100, 0.4). In Bertalanffy, (Case-1) and (α, β, γ) = (100, 5, 0.5) and in Gompertz , (Case-2), i.e., From the above two settings, artificial observations along with ages 2, 3, ..., 10 were used to generate the results in Table 1. By using the information in Table 1, we compared RSS-and Ridge-estimate using numerical experiments. In Case-1, under the proper initial value setting, we can obtain the proper estimate, which is shown as thicker dotted lines "Proper estimate" in Figure 1. On the other hand, improper initial value converges to improper estimate, which is shown as thinner dotted lines "Improper estimate" in Figure 1. When we apply ridge-estimation to this improper initial value, then the estimate is adjusted as close to the true parameters. On the other hand, in Case-2, under the proper initial value setting, we obtain the proper estimate, which is shown as thicker dotted lines "Proper estimate" in Figure 2. On the other hand, improper initial value converges to improper estimate, which is shown as thinner dotted lines "Improper estimate" in Figure 2. When we apply ridge-estimate to this improper initial value, then the estimate is adjusted as close to the true parameters. Table 2 summarize parameter estimates and estimated ridge parameter (λ in [4]).  In Case-1, the range of initial value is set as α ∈ [60, 140] and β ∈ [0.25, 1]. We obtain the RSSand Ridge-estimate from the combination of α and β. Figure 3 shows that while there are a few outlier-type estimates in RSS-estimate, Ridge-estimate has no such estimate within this range. Regarding the outlier in RSS-estimate, it was observed thatα tends to deviate downward, whileβ tends to deviate upward. In Case-2, the range of initial value is set as α ∈ [70,130], β ∈ [2.2, 10] and γ ∈ [0.22, 1]. No extreme outlier (as in Case 1) was observed, but for range variability differences were observed between RSS-and Ridge-estimate. It was observed that Ridge-estimate had less variability. For all three parameters α, β and γ , RSS-estimate generated a certain number of estimates that deviated from the appropriate estimates.

Conclusion
Let us consider the source of the error from an improper initial value setting of a nonlinear regression that leads to improper estimate, using the Case-1 scenario. The characteristics of Bertalanffy growth function is controlled by two parameters α and β. Figure 5 is a heatmap showing how RSS behaves depending on the parameters α and β. The horizontal axis is represented by α and the vertical axis is represented by β, and the RSS is expressed by shading and contour. The altitude of RSS in the heatmap is not a typical mortar-shaped concentric circle, centered on the true situation (α, β) = (100, 0.4). In the case of α as an example, there is a valley shaped bottom for RSS around α = 70 regardless of the β value. As observed before, the estimate from improper initial value (α, β) = (20, 5) converges toα = 67.152. This is because the iteration process cannot reach the best estimate after RSS decreases to a value around α = 70 along the horizontal direction. Such apparent convergence to the local minimum of RSS depending on the initial values is the cause of the instability for the estimate. This heatmap shows the behavior of RSS for Case-1 (Bertalanffy function). The horizontal axis denotes α and the vertical axis denotes β and the RSS corresponding to (α,β) is expressed by shading and contour. The location of true parameter (α,β) = (100, 0.4) is identified by black rhombus.
Actually, when the numerical parameter estimation in a nonlinear function such as the growth function does not work well, the first countermeasure to be taken is to reselect the initial value based on previous experience. In such a scenario, a try and error process is repeated while considering the role or meaning of the parameters. This is a very effective way when the number of samples is not so large to pay attention to the characteristics of each sample. However, when there are many samples, to estimate everything at once (for example to apply the same estimation procedure to many samples, we use "for" in the loop statement), is risky such that the entire calculation process may stop as a result of just one error, or cannot identify improper convergence. Under such a situation, this paper applied ridge estimate approach to the growth function to stabilize the estimation procedure. Throughout the numerical examination introduced in Section 3, it was evident that the ridge-estimate produces more stable results than the usual residual-based estimate. In order to quantify the stability of ridge estimate, Table 3 shows the mean squared error (MSE) estimated by RSS-and Ridge-estimate in Cases-1 and 2 (Bertalanffy and Gompertz). Here, the MSE is defined by the expected value of the difference between observation and prediction as ] .
This measures the average of the squares of the errors that is, the average squared difference between the estimated values and the actual value. The MSE is a risk function, corresponding to the expected value of the squared error loss. The MSE is a measure of the quality of an estimator. The closer the value is to zero, the better. In both cases, the ridge-estimate results in a smaller MSE value, which suggests that the ridge-estimate is better than the RSS-estimate.
is obtained by RSS-and Ridge-estimate under the two situations (Bertalanffy and Gompertz).
From the aforementioned section we learnt that in the estimation of the growth function, there is a risk that improper initial value setting will result in improper result. In other words, the estimation of the growth function has instability that depends on the initial value. Until now the setting of the initial value has been left to trial and error, and it depends very much on the experience and intuition of the researcher. Finding an appropriate initial value is a difficult task and extremely difficult especially when considering a function with a large number of parameters. One way to solve this difficulty is outline in this paper. In this paper, we propose a new application using the Ridge estimation approach. The Ridge estimation approach did not necessarily converge to a proper estimate for an initial value setting, but at least more stable convergence was observed than the RSS-estimate. Although the ridge-estimate has a bias, we know from equation [7] that the ridgeestimate is close to the RSS-estimate if an estimated λ is not large. It can be seen from Figure 6 that estimated ridge parameters were very small, so we can say that the ridge-estimates are comparable to the RSS-estimates. Although ridge estimate is not universal, it seems to have made it possible to extend the range of initial value setting that converges to a proper estimate. The vertical axis denotes the estimated value of Ridge parameter λ.