Nowadays, the individual nodes of a distributed parallel computer consist of multi- or many-core processors allowing to execute more than one process per node. The large difference in communication speed within a node through shared memory, versus across nodes through the network interconnect, requires to use locality-aware communication schemes for any efficient distributed application. However, writing an efficient locality-aware
MPI
code is complex and error-prone, because the developer has to use very different APIs for communication operations within and across nodes, respectively, and manage inter-process synchronization. In this paper, we analyze and enhance a recent one-sided communication model, namely DART-
MPI
, which is implemented on top of
MPI
-3. In this runtime system, the complexities of handling locality-awareness of
MPI
memory access operations, either remote or local, and the related synchronization calls are hidden inside the related DART-
MPI
interfaces resulting in concise code and improved application and developer productivity. We have carried out in-depth evaluation of our DART-
MPI
system. Foremost, a micro benchmark is conducted to help understanding the prime performance overhead of implementing APIs in DART-
MPI
system, which is small and becomes negligible with the growing message sizes. We then compare the performance of DART-
MPI
and flat
MPI
without locality awareness, in particular blocking and non-blocking memory operations, using a realistic scientific application on a large-scale supercomputer. The comparison demonstrates that in most cases the DART-
MPI
version of this application shows better performance than the flat
MPI
version. Further, we compare the DART-
MPI
version to a functionally equivalent
MPI
version, which thus includes code to deal with data-locality, and show that DART-
MPI
realizes almost the full potential of highly optimized
MPI
while maintaining high productivity for non-expert programmers.
抄録全体を表示