Abstract
The LUJ2D algorithm is a recently proposed numerical solution method for non-orthogonal joint diagonalization problems appearing in signal processing. The original LUJ2D algorithm attains low performance on modern microprocessors since it is dominated by cache ineffective operations. In this study, we propose a cache efficient implementation of the LUJ2D algorithm. The experimental results show that the proposed implementation is about 1.8 times faster than the original one, achieving 21\% of the peak performance on the Opteron 1210 processor using one core.