Journal of Computer Chemistry, Japan
Online ISSN : 1347-3824
Print ISSN : 1347-1767
ISSN-L : 1347-1767
Technical Paper
Performance Tuning of Parallel Fragment Molecular OrbitalProgram (OpenFMO) for Effective Execution on K-computer
Yuichi INADOMIJun MAKIHiroaki HONDAToshiya TAKAMITaizo KOBAYASHIMutsumi AOYAGIKazuo MINAMI
Author information
JOURNAL FREE ACCESS

2013 Volume 12 Issue 2 Pages 145-155

Details
Abstract

The performance tuning of parallel fragment molecular orbital (FMO) program (OpenFMO) was done to carry out massively parallel FMO calculations effectively on K computer, which is one of the fastest super computers in the world. In this tuning, we focused on the load-balancing of each small-scale molecular orbital calculation for monomer and dimer. To maintain the load-balance for each process, we used the dynamic load-balancing technique with the global counter, and the global counter was implemented using a de facto standard parallelization library such as MPI and OpenMP to keep the portability of our code.In our implementation of the global counter, one thread in each group is used as the master thread of global counter which doesn't calculate molecular integrals, it is required that thread support of MPI_THREAD_SERIALIZED level, and three kinds of codes be provided depending on the kind of the thread as shown in Figure 3, Figure 4 and Figure 5.As a result of applying the dynamic load-balancing using our global counter, the load of molecular integral calculation for each process was well-balanced in each small-scale calculation (see Figure 7 lower), and the parallelization efficiency of the molecular integral part became very high (94% in 256 parallel execution, see Figure 8, "molecular integral part"). On the other hand, it was observed that the parallelization efficiency of the SCF part was so bad, that it caused efficiency lowering of calculations of the monomer electronic structure (see Figure 8). The results of large-scale performance evaluation showed that high efficiency (93%) of coarse grained parallelization was achieved in 20480 parallel executions using the Intel Xeon PC cluster (see Figure 8 and Figure 9) and the elapsed time of the FMO calculation for a large molecule (16,764 atoms) was only 30 min.

Content from these authors
© 2013 Society of Computer Chemistry, Japan
Previous article
feedback
Top