Article ID: 2024EAP1180
Cache tiling or recursive data layouts for two-dimensional (2-D) data access has been proposed to ameliorate the poor data locality caused by conventional layouts like row-major and column-major. However, cache tiling and recursive data layouts require non-conventional address computation, which involves bit-level manipulations that are not supported in current processors, there is also a significant overhead in execution time due to software-based tiling address calculation. In this paper, we design a cache memory with hardware-based tile/line accessibility support for 2-D data access and a tile-set-based tag comparison (TSTC) scheme to optimize overall hardware scale overhead. Our technique captures the benefits of locality of the sophisticated data layouts while avoiding the cost of software-based address computation. Simulation results show the proposed method improves the performance of matrix multiplication (MM) over conventional data layout and Z-Morton order layout by reducing L1 cache, L2 cache and Translation Lookaside Buffer (TLB) misses, especially at larger matrix sizes. We implement the proposed cache with a SIMD-based data path by using 40 nm Complementary Metal-Oxide-Semiconductor (CMOS) technology. The entire hardware overhead of the proposed TSTC method was reduced to only 10% of that required for a conventional cache without performance degradation.