Abstract
Recent GPUs (Graphics Processing Unit) have great advantages in performance and memory bandwidth for general-purpose computing. The CUDA programming environment enables us the GPU computing easily as a SIMT(single-instruction, multiple-thread)-type accelerator. High-order Finite Difference Methods (FDM) have been applied to CFD (Computational Fluid Dynamics) and the advection equation has been examined as a typical benchmark. We study the computational performances depending on the arithmetic intensity for several high-accurate FDMs. The detail description of the GPU implementation of the 5th-order WENO scheme is given with respect to the usage of the shared memory and registers. Multiple-GPU computing is required for further speedups and large-scale computing beyond the memory size limitation on a graphics card. The computational domain is decomposed three-dimensionally and the overall performances depend on not only the computation but also the GPU to GPU communication. The overlapping techniques between the computation and the communication are well organized with changing the order of the GPU kernels. The strong scalability is shown on the TSUBAME grid cluster and the performance of 7.8 TFlops is achieved by using 60 GPUs, when we compute the advection equation with the 5th-order WENO scheme.