論文ID: 2025ECP5020
Many memory-bound AI applications, including natural language processing, transformer-based visual recognition, and multi-task online inference, rely heavily on large-scale general matrix-vector multiplication (GEMV), which is characterized by strong data locality. However, existing hardware architectures for AI model inference face significant data transfer overheads and fail to fully exploit the data locality inherent in these algorithms. We propose a scalable one-logic-two-DRAM (1L2D) multi-core near-DRAM computing accelerator based on 3D hybrid bonding for AI models. Our 3D integration of RISC-V processors with vector accelerators and DRAM presents a unique approach that significantly boosts bandwidth while reducing energy consumption. A memory access circuit supporting page hit mechanism and prefetching strategy is designed to maximize the utilization of the data locality achieved by the algorithm's partitioning and rearranging of data. An interleaving memory address mapping scheme is designed to effectively enhance the bank-level parallelism of data access. Compared with the high-performance Intel Xeon-6230 CPU and the state-of-the-art commercially available UPMEM-PIM, the proposed architecture's computational efficiency for large-scale GEMV is improved by 3.4× and 2.2×, respectively. The architecture achieves a 3.07× improvement in bandwidth and a 76% reduction in energy consumption over the HBM2-PIM.