Special purpose computers for ab initio molecular orbital calculation and the first principle density functional theory with special hardware, including special purpose LSI, for some heavy load processes (hot spots), have been developed as an attempt to make a cost effective supercomputer for personal use. The system was realized using embedded cell architecture recently adapted to game computers and electrical appliance. The nick name of the project is Embedded High Performance Computing system (EHPC project). A kind of software/hardware co-design approach was taken to realize the development in a short period during 2000-2004. Three types of hardware (Figure 3) were developed for 5 chemical calculation programs. At first we designed the platform architecture (EHPC platform: Figure 3-a) as the base system which has a large number of general purpose CPUs instead of special purpose LSI's and can emulate functions of the LSI with the firmware on the CPU. In the next step we developed the system (Figure 3-b) with special purpose LSI for ab initio molecular orbital calculation and the system (Figure 3-c)with FPGAs for three dimensional FFT calculation usung the Car-Parrinello method. The special purpose LSI for molecular integral evaluation is named Eric (Electron repulsion integral calculator). The size of Eric is 5mm*10mm with 2.1 MGates. The memory size is 704Kbytes and clock speed is 200MHZ. The performance power ratio is almost the same as Intel Pentium 4 (3.2GHz, 1Mb L2, 2GB Memory). The FPGA showed almost four times faster performance than the fastest program: FFTW on the Intel Xeon (2.4GHz).
In the development of special purpose computers for scientific calculation with special hardware, including special purpose LSI, for some heavy load processes (hot spots), the hardware design and system debugging often become serious bottlenecks of the development process. We took a software/hardware co-design approach to improve turnaround time. At first we designed the platform architecture (EHPC platform) as the base system, and we developed a prototype system based on the platform architecture. The prototype has a large number of general purpose CPUs instead of special purpose LSI's and can emulate the functions of the LSI with the firmware on the CPU. In the next step we developed the system with special purpose LSI. Then we have developed some special purpose computers for Molecular Orbital Calculation (GAMESS, Fragment MO), Density Functional Theory (Car-Parrinello, DV-Xalpha), Computational Fluid Dynamics (GSMAC-FEM). In this paper, we introduce the EHPC architecture and show implementations of some application programs on the architecture.
The execution time of Car-Parrinello based first principles calculation is dominated by 3D FFTs of electronic-state vectors. To accelerate these parts, the authors developed 3D-FFT logic, and implemented it on an FPGA board with four FPGA devices. The single board performs FFTs 10 times faster than a Xeon 2.4GHz processor. The speed up is about 50 times under the same power supply. With these two FPGA boards, we could accelerate the CP calculations 10 times faster than those on a Xeon 2.4GHz processor. For further acceleration of the CP codes, we propose a dynamic reconfigurable FPGA where both 3D-FFT and Gram-Schmidt orthogonalizations are performed.
Ab initio molecular orbital (MO) calculation is useful for solving many challenging problems concerning the development of new drugs, chemicals, polymers, materials, and so on. However, large-scale MO calculations have at least two difficulties. The first is the considerable numbers of Electron Repulsion Integral (ERI) computations. The second is the computational complexity of single ERI. We have developed a special purpose LSI embedded high-performance computing system for large-scale MO calculation, called EHPC (Embedded High Performance Computing) system. The considerable number of ERI calculations can be performed efficiently by the hierarchical parallel architecture, and the special purpose processors accelerate the computation of single ERI. Moreover, since these processors are designed for embedded purpose, we can arrange many processors in a compact system. Benchmark calculations using this EHPC system show that the speeds of ERI computations are directly proportional to the number of processors. Estimated power consumption is under 1KW and the volume is 1200 × 650 × 1010mm for a 56 processor system. This EHPC system fills the requirements of many chemists who have to perform large scale MO calculations with personal type high performance computers.
First, we review the general recursive expressions of molecular integrals overcontracted Gaussian functions which were reported by Honda, Yamaki, and Obara. Next, we have proposed an efficient algorithm for computing the electron repulsion integrals, and presented it in detail for the evaluationof <ppss> integrals. The present method has accomplished from 1.5 to 2.0 times faster computation than by GAMESS which is a widely used quantum-mechanical calculation program. The present algorithm is applicable to various kinds of molecular integrals, thus it would be very useful for further expansion of theoretical chemistry study.
Performance evaluation of two programs for electron repulsion integral calculation was done using the performance counter on Intel Pentium 4 processor (3.6GHz, EM64T, 1GB L2 cache). The programs were of the new Obara method [6, 7], and of the hybrid method of Vertical Recurrence Relation method (VRR) and Horizontal Recurrence Relation method (HRR). Though the floating point operations of the new Obara method are almost 20% smaller than those of the hybrid method, the total clock cycle, corresponding to the wall clock time, was more than 30% larger than that for the hybrid. The performance decrease is mainly due to memory access because load/store instructions of the new Obara method are almost 3 times larger, and the level 1 data cache difference is 25 times larger than that of the hybrid. It becomes clear that the reduction of the memory accesses is very important to improve the performance of integral calculation as well as the reduction of the number of floating point operations.
Parallel Fock matrix construction algorithm for molecular orbital (MO) calculation specific computer (Figure 1) has been developed. MO-specific computer consists of massive custom processors to calculate two-electron integrals on the EHPC platform system (Figure 2), which have layered architecture (Figure 3). In order to obtain high parallelization efficiency for Fock matrix construction (Figure 5) on the layered architecture, multi-level dynamic load balance scheme (Figure 6) is adopted to get better load balance and to localize communications within a tree structure of Tree-Comm network API. Parallelized Fock matrix construction routine was implemented into GAMESS ab initio program on EHPC platform system, which uses SH4 processor instead of special purpose LSI. Benchmark result on 63 worker SH4 processor system shows extremely high parallelization efficiency (Table 1, Figure 8) on the platform system.
The partially direct SCF-MO method was developed to improve the parallelization efficiency in the Fock matrix generation using a PC cluster without secondary storage on each processor. Some of the electron repulsion integrals are stored in buffer (unused memory) with their four indices at the first SCF cycle, and they are reused at the later SCF cycles. This simple method achieved super-linear scalability, for example, the parallelization efficiency became ca. 1.13 in the Fock matrix generation of the Crambin molecule (1974 basis functions), equipped by the 128 Xeon processors (2.8GHz) with 16GB buffer area (See Figure 4). This algorithm is suitable for the special purpose computers for fast evaluation of the electron repulsion integrals because the recent special purpose processor has usually no secondary storage and has a relatively large main memory.
We developed an integrated environment for virtual screening, Xsi, based on the force field method. Two methodologies, Ligand Based Drug Design(LBDD) and Structure Based Drug Design(SBDD) are both implemented on the same platform and users can realize various analysis flows by specially designed Xsi-Script. Calculation for conformation search on small molecules is parallelized on EHPC-SH4, with an efficiency of 96.6% by 21 CPUs. As a test calculation for SBDD, re-docking was performed for a complex of c-ABL, and sufficiently small RMSD(0.13Å) was obtained.