Parallel computing for high-resolution/large-scale flood simulation using the K supercomputer

This paper reports the implementation of high-performance computing using the K supercomputer in Kobe, Japan, for large-scale/high-resolution flood simulation. Supercomputer K was developed in 2012 by RIKEN and Fujitsu and ranked first in the list of Top 500 supercomputer sites in 2011 during its development stage. A two-dimensional inundation simulation model developed based on a shallow water equation using an existing numerical scheme was parallelized with the K supercomputer. Osaka and other cities along the Yodo River were chosen as application sites and the area discretized by 12824442 (= 3453 × 3714) nodes with a resolution of 10 m. The computational time for the five-hour flood simulation was measured by changing the number of 8-core CPUs of the K supercomputer. As a result, computational time was decreased to 9.3 min by using 128 × 64 = 8192 8-core CPUs. The computational time was 1423.7 min for one 8-core CPU. Thus, the simulation speed increased by a factor of 153.2 with the use of the K supercomputer.


INTRODUCTION
Parallelization is a method to enhance the computational speed of flood modelling.In fact, there are several approaches to shorten the CPU time required for flood modelling, such as changing the algorithm of the numerical model, enhancing the capability of the computer hardware, and/or parallel computing.A recent trend in computer hardware development aims at increasing the number of cores in a CPU rather than the clock frequency of a CPU.Therefore, at present, there are more advantages in the development of parallel computing methodology with respect to the reduction of computational time.This study is one such attempt, i.e. a parallel computing application on a particularly large-scale/ high-resolution flood simulation using the K supercomputer in Kobe, Japan.The K supercomputer is a next-generation high-performance computer developed by the Institute of Physical and Chemical Research (RIKEN) and Fujitsu as a part of the High-Performance Computing Infrastructure (HPCI) Initiative led by the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT).The shared use of the K supercomputer began in November 2012.It was ranked first in a list of the Top 500 supercomputer sites in June and November 2011 during its development stage.Since then, the computer has been considered one of the fastest computers in Japan from the perspective of LINear algebra PACKage (LINPACK).The CPU is SPARC64 VIIIfx (2.0 GHz) with 8 cores.This computer has a total of 82944 CPUs and thus, 663552 (= 82944 × 8) cores.
This paper reports the application of the K supercomputer to flood inundation simulation of Osaka and other cities in its vicinity.The simulation area was approximately 34.5 km × 37.1 km, discretized at a resolution of 10 m, resulting in 12824442 (= 3453 × 3714) computational nodes.The distributed rainfall-runoff/flood-inundation simulation model (DRR/FI, Kobayashi et al., 2011Kobayashi et al., , 2012Kobayashi et al., , 2014a, b;, b;Kobayashi and Takara, 2012) in the C environment with a parallel computing capability was applied.The proposed parallelization method is basically universal, and can thus be applied to any standard parallel computer.However, script code was written in order to stage-in the K supercomputer, to run the model, and to stage-out with the simulation results.Moreover, hybrid parallelization was used, i.e. 8 cores in each CPU were used exclusively for thread parallelization with Open Multi-Processing (OpenMP), while the CPUs of the K supercomputer were used for process parallelization (domain decomposition) with a Message Passing Interface (MPI).This strategy for core utilization differs from the previous works outlined below.Hybrid parallelization is recommended in the use of the K supercomputer because of the memory use and data transfer reduction.In case the core is also used for process parallelization (domain decomposition), the number of processes by MPI in each CPU increases; thus approaching limitations of both the memory and data transfer.Instructions related to the hybrid parallelization are included in the script code.
Several previous studies have investigated the application of parallel computing to flood-inundation modelling.In general, some inundation models have attempted to solve the full two-dimensional (2D) shallow water equations (Hunter et al., 2008), while others have used diffusive wave simplification by neglecting the inertial terms of the momentum equation (Leandro et al., 2014;Yu, 2010) or applied the so-called storage cell model based on Manning's equation for the flux calculation between cells (Bates and De Roo, 2000).Neal et al. (2010) presented the application of different (i.e.OpenMP, MPI, and accelerator card) parallelization techniques on the 2D inundation model LISMIN developed based on LISFLOOD-FP by Bates and De Roo (2000).LISMIN applies an adaptive time step.The LISMIN-MPI code was run on 1 to 100 Intel E5462 cores (2.8 GHz) with a computational domain of 200 × 384 cells (at a resolution of 2 m).As a result, the code realized an efficiency (Speed-up = Serial run time/Parallel run time, Efficiency = Speed-up/ Number of cores) of 70% by using 100 cores (2 rows per core) when the domain was fully wet.However, the results deteriorated in the case of a point source in Neal et al.'s (2010) simulation where the computational domain was not large compared with the simulated inundated area from the point source.The fully wet domain refers to the entire computational domain when it is initially filled with water to a depth of 0.1 m.The point source case refers to the case when a hydrograph is provided for a point in the computational domain that was initially dry.Although the following example below was a point source case, the inundation was limited to an area within the larger computational domain; thus, the computational speed as a point source case in the present paper could be faster than the speed under a fully wet case in the same computational domain.If this is true, the result is opposite to Neal et al. (2010).Leandro et al. (2014) presented the parallelized diffusive model (called P-DWave).The diffusive wave simplification neglected the inertial terms of the momentum equation and thus, generally led to faster CPU time.P-DWave can handle an adaptive time step.The model was developed in both Matlab and Fortran environments.Further, a speed-up of 1.7-5.2times and 1.1-1.7 times for Matlab and Fortran, respectively, was realized depending on the domain size and the number of processors.The authors used a mesh consisting of a maximum of 900 × 900 cells (at a resolution of 178.8 m) by using an AMD Opteron TM processor 6276 with 12 cores (2.3 GHz).When using 12 cores, their model resulted in a speed-up of 5.1 and 1.5 times for Matlab and Fortran, respectively.
However, because of the simulation stability criterion, some previous studies reported that diffusive wave models are computationally less effective than dynamic wave models with high-resolution meshes (Neal et al., 2012).Neal et al. (2012) reported that the time step in their diffusion wave model was proportional to Δx 2 ; whereas, the inertial/full shallow water model was proportional to Δx (Note that Δx is the spatial grid size).Moreover, the time step in their diffusive model was further reduced (i.e., closer to zero) when the water surface slope was small-i.e. it approached zero because the time step was related to the surface slope.Thus, a linearization of the time-step equation was implemented at a low slope.For this reason, the diffusion wave model becomes less computationally efficient when the cell size is small.Bates et al. (2010) reported that the computational cost of the original LISFLOOD-FP was higher than that of the full 2D shallow water equation for relatively fine-resolution (1-10 m) grids; therefore, it was improved by introducing the inertial terms in the governing equation.Bates et al. (2010) concluded that the new inertial code would most likely be faster than either diffusive or full shallow water models at any given spatial resolution, although they also commented that at a very low surface friction a full shallow water model may give more accurate results.
Since the flood models in the papers were developed by several different researchers in different countries with dif-ferent computational settings, it is not possible to describe the relative merits of the studies accurately.Nevertheless, the points for consideration basically consist of (1) the type of the governing equation of the model, (2) the computational cost at a fine resolution, (3) whether the model applies an adaptive time step, (4) the effectiveness when the number of nodes is significantly increased, and (5) whether the model is run on a workstation or a supercomputer.Table I shows a comparison of these points in the literature.With respect to (1), a full shallow water equation was used in this study.The models in the literature used either diffusion wave simplification or a cell storage model.With respect to (2), we set the model resolution to 10 m, which is not a superfine resolution, such as that of 2 m argued for in Neal et al. (2010).Nevertheless, the computational model domain is considerably larger, at approximately 34.5 km × 37.1 km, as compared to that considered by Bates and De Loo (2000), who targeted a 35-km stretch of the Meuse River in the Netherlands.We do not address the necessity of a superfine-resolution model in practice.With respect to (3), the model proposed in this paper does not handle an adaptive time step, considering the coupling with a dynamic wave river model and a slot model for the sewage network.Table I also identifies the studies in which an adaptive time step was used.With respect to (4), the number of nodes was increased considerably to 12824442 (= 3453 × 3714) in the present study because, with respect to (5), a supercomputer was used.The number of nodes and the use of a supercomputer in previous studies are also summarized in Table I.

K SUPERCOMPUTER
The K supercomputer was developed jointly by RIKEN and Fujitsu, and its shared use began in 2012.K in Japanese means 10 quadrillion.The computer can execute 10 trillion calculations per second (i.e. 10 PFLOPS) with LINPACK.This supercomputer was ranked first in a list of the Top 500 supercomputer sites in June and November 2011.The CPU of the K supercomputer is SPARC64 VIIIfx (2.0 GHz) with 8 cores.Its theoretical performance is 128 GFLOPS with 8 cores.The computer has a total of 82944 CPUs and thus, 663552 cores.The main memory is a DDR3-SDRAM-dual inline memory module (DIMM), and its main storage capacity is 16 GB per node.The theoretical bandwidth between the memory is 64 GB/s.Torus fusion (Tofu) is developed for interconnecting the 82944 nodes of the computer.The network topology is a six-dimensional (6D) mesh/torus.The theoretical link throughput is bidirectionally 5 GB/s.The specifications of the K-supercomputer are listed in Table II.

TWO-DIMENSIONAL INUNDATION SIMULATION
A 2D inundation simulation module of the DRR/FI model was used to exclusively study the effect of parallel computing on a 2D inundation modelling using the K supercomputer.The model was developed in the C environment.The following 2D shallow water equation was used with a structured grid: where h denotes the inland water depth; A Cell (10 m × 10 m) represents the 2D grid cell area; M = uh and N = vh indicate the discharge fluxes in the x-and y-directions, respectively; u and v denote the velocities in the x-and y-directions, respectively; and H represents the water depth + land elevation.A staggered grid was used for spatial discretization, and an explicit numerical scheme called the leapfrog method was used for temporal discretization (e.g.Inoue et al., 2000, Japan Society of Civil Engineers, 2001).Mass conservation in Equation (1) was discretized as follows: where i, j indicates the location of a cell in the x-and ydirections, respectively, and n denotes the time step.The momentum equation in the x-direction was discretized as follows: In the simulation, no flux boundary conditions were imposed on the surrounding boundaries.The land elevation of the entire simulation area is shown in Figure 1.The Yodo River indicated in blue in Figure 1 was eliminated from the simulation area.The red cross in Figure 1 is the location where the dike break occurs.A simulation result with q in = 1000 m 3 s -1 from the cross after five hours is shown in Figure 2 for reference.A one-dimensional dynamic wave river model was already parallelized.However, this method and results are not described here as the exclusive purpose of this paper is to discuss the effect of parallelizing a 2D inundation model by using the K supercomputer.In practice, it is difficult to predict the exact location of the dike break, and this is a hurdle for real-time prediction unless the break location is known immediately after the dike break occurs.

PARALLELIZATION METHOD
The computational domain shown in Figure 1 was decomposed for parallel computing as shown in Figure 3.In this figure the thin line denotes the original computational domain composed of 18 × 18 small cells.The original domain was decomposed along the bold lines, as shown in the figure, into domains of 3 × 3 cells as an example.The light blue colour shows a larger domain including the so-called sleeves, i.e. two adjacent peripheral nodes (orange and grey).The data transfer between the domains was carried out using these sleeves.The data transfer itself was realized using the MPI functions MPI_Isend and MPI_Irecv four times.
First, (i) the data in the vertical orange sleeve of domain A were transferred to the vertical orange sleeve of east domain B, as shown in Figure 4.At the same time, (ii) the vertical Figure 1.Land elevation of the entire simulation area grey sleeve of domain A receives the data of the vertical grey sleeve of east domain B. The same procedures were carried out for the north-south directions, i.e. (iii) the horizontal grey sleeve of domain A received the data from the horizontal grey sleeve of south domain C.Then, (iv) the data of the horizontal orange sleeve of domain A were sent to the horizontal orange sleeve of south domain C. In the north-south data transfer, the data of the four corners were renewed, despite the fact that they were already renewed once in the east-west data transfer.
In addition, the file output procedure was parallelized since the data conversion to text incurs a high cost in terms of computational speed.Previously, all simulation results were gathered in the master process, and then, only the master process was carried out for data conversion to text.Consequently, the computational burden was concentrated on the master process, while the other processes waited for the task to be completed; i.e. the master process was the simulation bottleneck.To solve this problem, the code was modified in such a way that each process carried out the text conversion, and then, the converted text was sent to the master process.As the file size is large when the number of the computational nodes is 12824442, this parallelization reduces the computational speed.
Moreover, thread parallelization was implemented throughout the code when possible using OpenMP.

RESULTS
The results of the parallel computing application are shown in Figures 5 and 6 and Tables III-V. Figure 5 shows the elapsed time for the five-hour simulation shown in Figure 2 against the number of CPUs of the K supercomputer. Figure 6 shows the speed-up and efficiency of the simulation.As already described, Speed-up = Serial run time/Parallel run time, Efficiency = Speed-up/Number of cores.However, in this study, the number of cores was equivalent to the number of CPUs in the calculation, because the 8 cores in each CPU are used exclusively for thread parallelization.Thus, this number cannot necessarily be compared with that in the past research discussed above.Vertical decomposition indicates that the computational domain was decomposed only in the x-direction.Likewise, horizontal decomposition indicates that the computational domain was decomposed only in the y-direction.Average decomposition means that the domain was decomposed in both x-and y-directions.The effects of the decomposition of the domain are presented in Tables Figure 2. Simulated inundation depth with q in = 1000 m 3 s -1 from the cross after five hours     .020.9 20.4 12.5 7.0 6.9 3.7 1.9 1.8 III-V.Table III shows the results for the case when the domain was decomposed only in the x-direction.Likewise, Table IV shows the results when the domain was decomposed only in the y-direction.Table V shows the results obtained when the domain was decomposed in both x-and y-directions.
From these figures and tables, we can summarize the following: (1) The fastest computational time among all simulations was realized by a 128 × 64 = 8192 domain decomposition (CPUs) with 8 threads (cores) in each CPU.The minimum computational time was 9.3 min.
(2) The maximum computational time among all simulations was 1423.6 min for a 1 × 1 domain (CPU) with 8 threads (cores) in each CPU.Note that the computational time with 1 CPU and 1 core is more than 24 h, which is the maximum time limit for the use of the K supercomputer; thus, the exact time taken could not be calculated.
(3) The maximum/minimum computational time was 153.2.In other words, the simulation speed increased by a factor of 153.2.(4) However, the efficiency obtained by using the 8-core 8192 CPUs was 1.9%.(5) An efficiency of around 50% was realized when the total number of CPUs was 64 (i.e.512 cores).The speed-ups in these cases were around 30%-35%.(6) In general, the computational time decreased with an increase in the number of CPUs.Nevertheless, a linear speed-up was not realized.The main reason for this is most likely related to a load imbalance problem.The decomposed domain with the dry cells must wait for the calculation of the wet cells, i.e. for slower calculation domains.Data are transferred at each time step, meaning that the load imbalance that accumulates after each step does not result in an ideal speed-up.decompositions.This is because the required time for communication in sleeves does not change considerably during the one-way decomposition given that the amount of communication in the sleeve is almost the same.In contrast, there was a significant reduction in the communication time when the domain was decomposed in both i and j directions, implying that there is less communication in a root-square order as the domain decomposes in two directions.In any case, the time required for communication is negligible, as com-pared to that required for the main loops (i.e. the calculations), which are mostly already parallelized.Thus, an ideal speed-up was not realized probably because of the load imbalance.

CONCLUDING REMARKS AND FUTURE RESEARCH DIRECTIONS
In this paper, we presented an example of the application of the K supercomputer to a large-scale/high-resolution flood-inundation simulation.The application site was the Osaka region along the Yodo River.The simulation area size was approximately 34.5 km × 37.1 km.The area was discretized with a mesh resolution of 10 m, resulting in 12824442 (= 3453 × 3714) computational nodes.As a result, the computational speed was considerably enhanced.The fastest computational time among all the simulations was realized by a 128 × 64 = 8192 domain decomposition (CPUs) with 8 threads (cores) in each CPU.The computational time was 9.3 min for a five-hour simulation.Compared to the simulation with 1 × 1 domain (CPU) with 8 threads (cores) in each CPU, the speed was 153.2 times faster.This speed is sufficiently fast and is therefore suitable for real-time simulations as well.
In the future, the simulation speed would be even faster if the original algorithm of the inundation model is further improved; however, we believe that the results reported in this paper will not be rendered inappropriate because of this improvement.In any case, the capability of the K supercomputer has been proven.The K supercomputer power will become more prominent when the number of computational nodes increases.A total of 12824442 (= 3453 × 3714) computational nodes is not considered extremely high in the field of computational fluid mechanics.On the other hand, attention should be paid to the use of the model in practice.One way is to reduce the number of CPUs to a practical level so that the model can be used with a generic parallel computer.The computational efficiency calculated in this study suggests that the level of efficiency does not necessarily improve when more CPUs are used, probably because of the load imbalance of the simulation itself between the decomposed domains, which is the next problem to be solved.
In any cases, an approach to enhance the simulation speed with a commercial desktop computer is worth pursuing.Nevertheless, the approach using supercomputers should be developed further for a bright future, since the evolution of the computer is often beyond our expectations.

Figure 3 .
Figure 3. Schematic representation of the domain decomposition.The thin line represents the original domain.The original domain is decomposed by the larger domain denoted with bold lines.The orange and grey colours indicate the sleeves for the data transfer

Figure 4 .
Figure 4. Schematic representation of the data transfer using sleeves (grey and orange).(i), (ii), (iii), and (iv) correspond to the procedure described in the main text of the paper

Figure 5 .
Figure 5. Elapsed time for the five-hour simulation against the number of nodes (CPUs) of the K supercomputer

Table I .
Simulation conditions in the past literature

Table II .
Specifications of the K supercomputer(Fujitsu Journal, 2012)

Table IV .
Same as inTable III but the domain was decomposed only in y-direction

Table V .
Same as in TableIIIbut the domain was decomposed in both x-and y-directions

Table VI .
Time for communication (i.e.data transfer) in the simulation