Article ID: 22.20250246
For neural network accelerators with General Matrix Multiplication (GEMM) as the computational core, the input feature maps of convolution must be converted into 2D matrices through the Im2col operation. Conventional approaches utilize CPUs to execute Im2col management and data transfer operations. Conventional methods suffer from memory expansion due to redundant data in overlapping convolutional windows, thus incurring non-negligible memory access energy consumption and transmission latency overheads. This severely limits the feasibility of efficient GEMM acceleration in resource-constrained edge devices. This paper proposes a novel Low Memory Access Im2col Method (LMAI2C) and present its dedicated hardware implementation. By restructuring data from overlapping convolutional windows, LMAI2C significantly reduces DRAM memory access volume while improving feature map transfer efficiency. Experimental results on convolutional layers of the YOLOv4-tiny network demonstrate that LMAI2C reduces DDR memory access by approximately 79.8% compared to traditional methods. Furthermore, LMAI2C demonstrates an average speedup of 69 times compared to CPU-based methodologies and 43 times over DMA-accelerated CPU implementations.