Abstract
We ported CADMAS-SURF/3D program into a CUDA-MPI hybrid parallel application to achieve further enhancement in execution speed, and studied it's characteristics. By eliminating the data rearrangement overhead remained in the previous paper, we achieved 800% relative speed in single process execution. By utilizing CUDA runtime enabled MPI library, we found that the porting was quite simplistic and straight forward. Even though we verified all variable contents at runtime to justify the algorithm equality between CUDA and FORTRAN, we encountered instability in computation results. But some cases indicated that further speed enhancement was quite possible with hybrid parallelism.