A New Recurrence Formula for Efficient Computation of Spherical Harmonic Transform

A new recurrence formula to calculate the associated Legendre functions is proposed for efficient computation of the spherical harmonic transform. This new recurrence formula makes the best use of the fused multiply–add (FMA) operations implemented in modern computers. The computational speeds in calculating the spherical harmonic transform are compared between a numerical code in which the new recurrence formula is implemented and another code using the traditional recurrence formula. This comparison shows that implementation of the new recurrence formula contributes to a faster transform. Furthermore, a scheme to maintain the accuracy of the transform, even when the truncation wavenumber is huge, is also explained.


Introduction
In many global atmospheric general circulation models, e.g., OpenIFS (ECMWF 2017) and DCPAM (Takahashi et al. 2016), the spectral method with spherical harmonics is used in horizontal discretization. To evaluate the nonlinear terms in these models, the transform method, which was introduced by Orszag (1970), is used for computational efficiency. The transform method requires two kinds of transforms: Grid point data are transformed to coefficients for the spherical harmonics in the forward transform and the reverse is done in the backward transform. We call the two kinds of transforms the spherical harmonic transform hereinafter. Because the spherical harmonic transform must be computed many times per one time step in time-integrations of the general circulation models, it is a fundamental tool, but it is desirable to reduce the computational time for the transform. Note that the spherical harmonic transform is also used in other research areas, e.g., cosmic microwave background studies (Reinecke et al. 2006) and planetary dynamos (Wicht and Tilgner 2010).
The spherical harmonic transform consists of the discrete Fourier transform for the longitudinal direction and a transform from coefficients of associated Legendre functions to function values at specified nodes, which we call the associated Legendre function transform hereinafter, for the latitudinal direction. The discrete Fourier transform can be computed with the fast Fourier transform (FFT) algorithm, and its computational complexity is O (M 2 log 2 M ), where M is the truncation wavenumber of the spherical harmonic transform. The associated Legendre function transform has a computational complexity of O (M 3 ), and so, it is one of the most time-consuming parts in the computation of dynamics in a spherical spectral model when the horizontal resolution of the model becomes fine. For the associated Legendre function transform, no exact fast algorithm has been discovered. However, several fast algorithms have been proposed to compute the associated Legendre function transform approximately, e.g., Healy et al. (2003) and Seljebotn (2012). In particular, in Seljebotn (2012), a fast algorithm, the computational complexity of O (M 2 log 2 M ), was proposed, and the implementation detail was described. Although such approximate fast algorithms may be promising when M is large because of the asymptotic behavior of the computational complexity, they have disadvantages, for example, that the implementations are very complicated, that they conduct the transform approximately within a given accuracy, and that they require a huge memory area when M is large.
As a strategy to deal with the computational complexity for the associated Legendre function transform other than exploring a fast algorithm, optimizing corresponding numerical codes is a natural choice. Schaeffer (2013) showed that his highly optimizing numerical code for the spherical harmonic transform spent much less computational time than that spent by other competitive codes, which included numerical codes based on fast algorithms. While several tuning techniques have been incorporated into his numerical code, one of the most essential parts is that the associated Legendre functions are computed "on-the-fly" during the transform rather than computed and stored before the transform. The reason why this on-the-fly computation contributes to speed tuning is that it requires much less memory (O (M 2 )) than that required when computed in advance (O (M 3 )). Therefore, this takes advantage of the cache memory on modern computers. However, computing the associated Legendre functions on-the-fly has computational complexity of the same order as that for the transform itself (O (M 3 )). Because the associated Legendre functions are computed using a recurrence formula, a more efficient recurrence formula requiring less computation would be beneficial.
In this paper, we propose a new efficient recurrence formula for the associated Legendre functions. We show that the recurrence formula is effective especially on recent computers that have fused multiplyadd (FMA) operations. The recurrence formula also has the advantage of maintaining accuracy compared to the commonly used recurrence formula. To demonstrate that the adoption of the new recurrence formula contributes to the transform speed, we develop a numerical code that adopts the proposed recurrence formula and compare its speed with the speed of the code proposed by Schaeffer (2013). In developing the code, care must be taken to maintain the accuracy of the transform when the truncation wavenumber is huge. Accordingly, we also introduce a scheme to maintain the transform accuracy. Furthermore, we explain several tuning techniques incorporated into the developed numerical code.
The remainder of the present paper is organized as follows. After we describe the fundamentals of the spherical harmonic transform in Section 2, we propose a new recurrence formula for the associated Legendre functions and explain why the new formula is efficient compared to the commonly used recurrence formula in Section 3. In Section 4, a scheme for the transform accuracy to be maintained even when the truncation wavenumber is huge is introduced, and several tuning techniques implemented in developing a numerical code for the transform are described in Section 5. In Section 6, the speed and accuracy performance of the developed numerical code are compared with those of another numerical code based on Schaeffer (2013). A summary and a discussion are presented in Section 7.

Fundamentals of the spherical harmonic transform
In the spherical spectral method, dependent variables in the governing equations of the model are expanded using spherical harmonics as follows: (1) where f is a dependent variable, such as temperature, s n m is the expansion coefficient, λ is the longitude, μ = sin φ, φ is the latitude, and M is the truncation wavenumber. The spherical harmonics, Y n m , is defined as Here, P n m ( μ) is the associated Legendre function, which is defined as Note that P n m ( μ) is normalized to satisfy the following orthogonality relation: Here, δ nn¢ is the Kronecker delta. By this orthogonality, the following "inverse" of (1) holds: Because Y n -m = (Y n m )*, where ( )* indicates the complex conjugate, s n -m = (s n m )* must be satisfied if f (λ, μ) is a real function. In the following, it is assumed that this constraint is satisfied.
In most spherical spectral models, the expansion (1) is evaluated on grid points (λ k , μ j ) (k = 0,1, ¼, K-1; j = 1, 2, ¼, J ). That is, Here, μ j ( j = 1, 2, ¼, J ) are called Gaussian nodes, which are defined as zero points (sorted in ascending order) of P J 0 ( μ), and λ k = 2πk /K (k = 0,1, ¼, K -1). When both K > 2M and J > M hold, the discrete counterpart of (5), that is, the "inverse" of (6), is given as Here, w j ( j = 1, 2, ¼, J ) is called the Gaussian weight and is defined as In this paper, we call the transform from s n m to f (λ k , μ j ) as described by (6) the backward transform and the transform from f (λ k , μ j ) to s n m as described by (7) the forward transform.
The backward transform (6) and the forward transform (7) can be divided into two stages, respectively, as follows: The operation Re( ) indicates that the real part of the argument is taken. Although the computations of (10) and (11) can be done with a relatively low cost using FFT, for which highly optimized libraries are available, e.g., FFTW (Frigo and Johnson 2005), no exact fast algorithm has been found, and hence, optimization becomes important for the computations of (9) and (12).

A new recurrence formula
In the computation of the associated Legendre function transforms (9) and (12), the simplest implementation is to compute all of the P n m ( μ j ) needed in advance and store them in the computer memory to be used repeatedly. With this implementation, however, the amount of data that should be loaded is of the same order as the number of required operations; such an implementation fails to take advantage of the cache memory on modern computers. To overcome this problem, it is natural to try to compute all of the associated Legendre functions "on-the-fly", that is, to compute them simultaneously every time with the transform. This is one of the most essential points proposed in Schaeffer (2013), although the idea of computing the associated Legendre function on-thefly itself has been adopted previously in other studies, such as Ishioka et al. (2000), Rivier et al. (2002), and Reinecke (2011). In these on-the-fly computations, the following recurrence formula is adopted: 1 . This recurrence formula requires three multiplications and one addition per one step even if (- n m ) and 1/ n m +1 are computed and stored in advance. On the other hand, in computing (9) and (12), two multiplications and two additions are required per one n for m ¹ 0, because both real and imaginary parts of s n m and (w j g j m ) should be treated. That is, the traditional recurrence formula (13) requires the same number of operations as the transform itself per one n. Note that, a multiplication and an addition have the same computational cost on modern computers. Furthermore, modern computers have FMA operations, and on such computing platforms, a pair of multiplication and addition has the same computational cost as a single multiplication or a single addition. Therefore, on computers having FMA, the traditional recurrence formula (13) requires more computational cost than the transform itself because the recurrence formula requires operations in which the computation is equal to that of three FMAs while the transform itself requires two FMAs. In the following, we derive a new recurrence formula to reduce the number of required operations.
The traditional recurrence formula (13) can be rewritten as follows: By multiplying μ on both sides of (14) and applying (14) on the right-hand side recursively, we obtain, µ µ µ µι In this form of recurrence formula, the associated Legendre functions, which have even numbers in the lower suffix, are treated independently of those with odd suffixes. First, let us consider the Legendre functions P n m ( μ) for which (nm) is an odd number, which means that they are odd functions of μ. If we write n = m + 2l + 1 (l = 0,1, ¼), the recurrence formula (15) can be rewritten as That is, p l m ( μ) is an even function of μ. Substituting To simplify (18), we impose the following condition on α l m : This means that the coefficients multiplied on p m l +1 ( μ) and p m l -1 ( μ) on the right-hand side of (18) have the same absolute value but opposite signs. From (19), the following equation is derived.
Applying (20) The reason for this choice is explained later. Substituting (22) into (21) By multiplying (-1) l α l m on both sides of (18) and then using (23) we can rewrite (24) Using the new recurrence formula (26), the lower suffix l of p l m ( μ j ) can be increased by one with two FMA operations if μ j 2 , a l m , and b l m are computed and stored in advance; to store them, only O (M 2 + J ) memory area is required. That is, one FMA operation is required per one increase in the corresponding suffix n of P n m , and this FMA operation cost is onethird the cost of the traditional recurrence formula (13). One might think that the efficiency of the new recurrence formula (26) would be lost if it must also be used in the case that nm is even. However, this problem can be circumvented as follows. The backward transform (9) Note that, in the derivation of (28), we used the fact that (14) Therefore, the backward associated Legendre function transform can be computed using (29) As described above, the computational procedure of the backward and the forward associated Legendre function transforms with the new recurrence formula is completed. The coefficients α l m (l = 0,1, ¼) can be computed recursively using (21) from the starting point given by (22). To compute p l m ( μ) (l = 0,1, ¼) using (26), the starting points p 0 m ( μ) and p 1 m ( μ) must be given. From definition (3), the following holds: Therefore, by considering (17) and (22)

Maintaining accuracy
As described in Enomoto (2015), using the traditional recurrence formula (13) naively leads to an accuracy problem when the truncation wavenumber is large, such as M  1000. This is because the absolute values of the starting points for (13), P m m ( μ) and P m m+1 ( μ) given in (38), become too small to be expressed in finite digit precision arithmetic and this causes an underflow when m is large and | μ | » 1. This problem also appears even when the new recurrence formula (26) is used. To overcome this problem, we adopt the following strategy, while there exist other strategies, e.g. "enhanced exponent" approach (Reinecke 2011). First, we design the transform algorithms to require the on-the-fly computation of p l m ( μ j ) only when underflow does not occur. In (31), (32), (35), and (36), the computations for l = 2γ and l = 2γ + 1 (γ = 0,1, ¼) are paired with each other so that p m 2γ+2 ( μ j ) and p m 2γ+3 ( μ j ) for the next step can be computed from p m 2γ ( μ j ) and p m 2γ+1 ( μ j ) using (26). In this procedure, we introduce j γ m (γ = 0,1, ¼) so that j γ m £ J/2 is the maximum integer that satisfies the following: Here, ξ > 0 is a prescribed threshold value, which we set as ξ = 10 -20 with a margin in IEEE754 double precision arithmetic to determine which operations can be omitted. That is, for a γ, computations that include p m 2γ ( μ j ) or p m 2γ+1 ( μ j ) ( j < j m γ ) are omitted. For example, in (36), the computation for a γ is done practically as For the case of γ = 0, we define j m -1 = J/2 + 1 for convenience. The memory area required to store these values is only of O (J ) for each m. Furthermore, the absolute values of these starting points for the onthe-fly computation are large enough to not cause the underflow because of the design described above. Note that the omission of the computations near the pole described above leads to reduction of the computations required for the spherical harmonic transform, and such implementations were also proposed in previous works, such as Juang (2004) and Schaeffer (2013). This kind of idea to reduce the computations corresponding to the polar regions dates back to Hortal and Simmons (1991), where the reduced Gaussian grid proposed by Kurihara (1965) was used to reduce the computational cost for time-integration of spherical spectral models.
Second, to avoid the underflow in the computation of the starting points defined above, we use the following procedure. From (38) When m is large and | μ j | » 1, the absolute value of p 0 m ( μ j ) becomes very small and so leads to the underflow in the finite digit precision arithmetic. Instead, we introduce q 0 m ( μ j ) and q 1 m ( μ j ) as To avoid the overflow in the computation of (45), β γ m (γ = 1,2, ¼) is a scaling factor introduced as follows: In addition, we define β 0 m = 1 for convenience. We set η = 10 270 with a margin in IEEE754 double precision arithmetic. Because log ( p 0 m ( μ j )) can be computed without the risk of underflow as Although the use of the logarithmic and the exponential functions might seem costly in the computations (48), their use is not problematic because these computations must be done only once in advance to provide the starting points for the on-the-fly recurrence.
Third, although it is not directly related to avoiding the underflow, the computations that should be done in advance but are not required to be done on-the-fly can be done in a higher precision arithmetic. For example, computations of α l m are done with IEEE754 quadruple precision arithmetic to keep the accuracy as high as possible. The Gaussian nodes and the Gaussian weights are also computed with IEEE754 quadruple precision arithmetic using a simple Newtonian iteration in μ and using (8), respectively, which provides sufficient precision for the spherical harmonic transform within IEEE754 double precision arithmetic.

Implementation and optimization
On the basis of the new recurrence formula proposed in the previous sections, we provide an implementation of the spherical harmonic transform, (6) and (7), as Fortran77 subroutines in open-source software ISPACK (ispack-2.1.3; Ishioka 2017). In the implementation, several subroutines are written in an assembly language to fully utilize not only FMA operations but also single instruction multiple data (SIMD) operations in Intel CPUs. That is, in the computations of (31), (32), (35), and (36), the loops for suffix j are implemented so that SIMD operations are used. FFT is done using an originally developed compact FFT library, which shows a comparable performance to FFTW. Furthermore, to utilize recent many-core CPUs, the backward and the forward associated Legendre function transforms for m = 0,1, ¼, M are computed in parallel using OpenMP and by specifying the dynamic scheduling option.

Speed and accuracy comparison
To evaluate the performance of the numerical library ISPACK, we compare its speed and accuracy with those of SHTns (SHTns-2.8; Schaeffer 2017), which was developed on the basis of Schaeffer (2013). The comparison methodology, proposed in Schaeffer (2013), is as follows. In the spherical harmonic transforms (6) and (7), we first set the value of each of the real and the imaginary parts of each s n m (n = 0,1, ¼, M ; m = 0,1, ¼, n) to a uniform random number in [-1,1]. Next, we compute the backward transform (6). Then, the forward transform (7) is computed from f (λ k , μ j ), and we write the obtained result as (s n m )¢ because the input of the backward transform is not completely equal to the output of the forward transform in finite digit precision arithmetic. The speed of each of the libraries is evaluated by measuring the elapsed time required for each transform. The accuracy of each of the libraries is also evaluated by calculating the L ¥error and the L 2 -error, which are defined, respectively, as First, we compare the speed. The computational platform is a Linux (Debian 9.1) server that has two Xeon E5-2699v4 CPUs. Each of the CPUs has 22 cores and the total number of cores of the server is 44. The Fortran compiler and the compiling option for ISPACK are "gfortran 6.3.0" and "-O3 -march=native -fopenmp", respectively. The C compiler and the compiling option for SHTns are "gcc 6.3.0" and "-O3 -march=native -ffast-math", respectively. The tested truncation wavenumbers are M = 1023, 2047, 4095, 8191, and 16383. The numbers of the longitudinal and the latitudinal nodes are set as I = 2(M + 1) and J = M + 1, respectively. Table 1 shows the comparison of the elapsed times by the number of OpenMP threads set as 44. Although the elapsed times increase as M increases approximately in proportion to M 3 for both SHTns and ISPACK, the elapsed times for ISPACK are shorter than those for SHTns in each case of M for both the forward and the backward transforms. In particular, the superiority of ISPACK to SHTns in elapsed times is more significant in larger M cases. The superiority is thought to originate from the reduction of FMA operations for the recurrence computations in ISPACK, because SHTns adopts a similar on-the-fly computation strategy of the associated Legendre functions as is done in ISPACK, but the recurrence formula used is the standard one shown as (13). Furthermore, the elapsed time for ISPACK is shorter for the forward transform than that for the backward transform for each M, although the compu-tational complexity is the same between the forward and the backward transforms. The difference is due to the implementation of ISPACK, in which the memory access for the forward transform is less than that for the backward transform.
Second, we compare the accuracy. Table 2 compares the L ¥ -error and the L 2 -error for SHTns with those for ISPACK for M = 1023, 2047, 4095, 8191, and 16383. Although these error estimates depend on the chosen random numbers and they become large as M increases, the L ¥ -error and the L 2 -error for ISPACK are smaller than those for SHTns for each M. This is thought to be due to the reduction of the number of computations required for the recurrence formula adopted in ISPACK.

Summary and discussion
In the present paper, we presented a new recurrence formula to calculate the associated Legendre functions to efficiently compute the spherical harmonic transform. The new recurrence formula was derived to take advantage of the fused multiply-add operation implemented on modern computers. We compared the computational speed and the accuracy of a numerical code, ISPACK, which is based on the new recurrence formula for the spherical harmonic transform, with those of another numerical code, SHTns, which was developed based on Schaeffer (2013) and showed that ISPACK was superior to SHTns in both speed and accuracy. We did not compare ISPACK with any fast algorithms. However, in Schaeffer (2013), SHTns was compared with a fast algorithm (Healy et al. 2003), and SHTns was shown to be several times faster than the fast algorithm. Furthermore, Schaeffer (2013)  3.4 ´ 10 -12 1.2 ´ 10 -12 9.4 ´ 10 -12 5.5 ´ 10 -12 5.5 ´ 10 -11 1.6 ´ 10 -11 6.4 ´ 10 -11 3.9 ´ 10 -11  rms SHTns ISPACK 5.4 ´ 10 -14 4.6 ´ 10 -14 1.3 ´ 10 -13 9.4 ´ 10 -14 2.6 ´ 10 -13 2.0 ´ 10 -13 7.0 ´ 10 -13 4.5 ´ 10 -13 1.0 ´ 10 -12 8.3 ´ 10 -13 showed that SHTns was about four times faster than libpsht (Reinecke 2011) at M = 4095 andSeljebotn (2012) showed that his newer fast algorithm, Wavemoth, was about six times faster than libpsht in the case corresponding to M = 4095. On the other hand, as shown in Table 1, ISPACK is about 1.5 times faster than SHTns at M = 4095, considering the average elapsed time for the backward and forward transforms. Therefore, ISPACK is estimated to have comparable speed as Wavemoth at M = 4095, although this estimate is a rough one because the computing platforms used by these papers are all different. A faster numerical code for the spherical harmonic transform is desired not only for constructing global atmospheric general circulation models using a spherical spectral method, as mentioned in Section 1, but also for analysis of global data even when the data are generated by a finite-difference model or by observation. This is because the spherical harmonics are eigenfunctions of the horizontal Laplacian on a sphere and the spherical harmonic expansions are useful for analyzing global atmospheric dynamics. Hence, we believe that the new recurrence formula proposed in the present study will contribute to not only constructing a global spectral model but also a global atmospheric data analysis. In particular, when the horizontal resolution of global data is very fine, an onthe-fly computation strategy of associated Legendre functions is inevitable because of the limitations of computer memory. Thus, the new recurrence formula has the potential to become a fundamental technique. Furthermore, owing to the less memory usage, the on-the-fly computation strategy combined with the proposed new recurrence formula could enable use of a higher resolution global atmospheric general circulation model or a larger ensemble size on a wide range of machines from workstations to supercomputer systems.