2025 Volume 16 Issue 1 Pages 43-63
Developing hardware for Artificial intelligence (AI) training is vital. A hardware-oriented optimizer, named Holmes enables faster training with a smaller memory footprint. This study developed a hardware architecture that incorporates Holmes and benefits from parallelization and pipelining to achieve significant throughput improvement. We determined the required bit width for training and used it the architecture evaluation. We investigated scalability and the effectiveness of both Holmes and pipelining. The results proved the linear scalability of the memory footprint over the model size, reduction of the memory footprint by utilizing Holmes, drastic increase in throughput by pipelining and faster computing.