Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster

Takanobu BABA; Shinpei WATANABE; Boaz JESSIE JACKIN; Kanemitsu OOTSU; Takeshi OHKAWA; Takashi YOKOTA; Yoshio HAYASAKI; Toyohiko YATAGAI

doi:10.1587/transinf.2018EDP7346

Regular Section

Fast Computation with Efficient Object Data Distribution for Large-Scale Hologram Generation on a Multi-GPU Cluster

Takanobu BABA, Shinpei WATANABE, Boaz JESSIE JACKIN, Kanemitsu OOTSU, Takeshi OHKAWA, Takashi YOKOTA, Yoshio HAYASAKI, Toyohiko YATAGAI

著者情報

キーワード: computer generated holography, large-scale CGH, GPU cluster

ジャーナルフリー

2019 年 E102.D 巻 7 号 p. 1310-1320

DOI https://doi.org/10.1587/transinf.2018EDP7346

詳細

抄録

The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include a change of the way of object decomposition, reduction of data transfer between the CPU and GPU, kernel integration, stream processing, and utilization of multiple GPUs within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. Experimental results show that intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain an execution time of 4.28 sec for generating a 1.6 giga-pixel hologram from a 3.2 giga-pixel object. It means a 237.92 times speed-up of the sequential processing by CPU and 41.78 times speed-up of multi-threaded execution on multicore-CPU, using a conventional FFT-based algorithm.

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）