2013 Volume 12 Issue 2 Pages 145-155
The performance tuning of parallel fragment molecular orbital (FMO) program (OpenFMO) was done to carry out massively parallel FMO calculations effectively on K computer, which is one of the fastest super computers in the world. In this tuning, we focused on the load-balancing of each small-scale molecular orbital calculation for monomer and dimer. To maintain the load-balance for each process, we used the dynamic load-balancing technique with the global counter, and the global counter was implemented using a de facto standard parallelization library such as MPI and OpenMP to keep the portability of our code.In our implementation of the global counter, one thread in each group is used as the master thread of global counter which doesn't calculate molecular integrals, it is required that thread support of MPI_THREAD_SERIALIZED level, and three kinds of codes be provided depending on the kind of the thread as shown in Figure 3, Figure 4 and Figure 5.As a result of applying the dynamic load-balancing using our global counter, the load of molecular integral calculation for each process was well-balanced in each small-scale calculation (see Figure 7 lower), and the parallelization efficiency of the molecular integral part became very high (94% in 256 parallel execution, see Figure 8, "molecular integral part"). On the other hand, it was observed that the parallelization efficiency of the SCF part was so bad, that it caused efficiency lowering of calculations of the monomer electronic structure (see Figure 8). The results of large-scale performance evaluation showed that high efficiency (93%) of coarse grained parallelization was achieved in 20480 parallel executions using the Intel Xeon PC cluster (see Figure 8 and Figure 9) and the elapsed time of the FMO calculation for a large molecule (16,764 atoms) was only 30 min.