Abstract
The advent of various kinds of multi-core scalar CPUs had changed the rule of the HPC world again. To improve the total performance of a PC cluster or a MPP, not only the parallel efficiency, but also the performance per computational node is becoming more important than ever before. Here, we propose a new design and programming style to optimize the performance of a finite element code on a multi-core CPU node, considering its multimedia extension instruction set and cache hierarchy. Using element-by-element operations of solid tetrahedral 2nd order elements for linear structural problems, we demonstrate that our EBE matrix storage-free iterative solver is not only more efficient in memory usage but also faster than standard non-zero component storage solvers on a multi-core PC.