論文ID: 2025VLP0007
In this paper, we propose applying bit-serial arithmetic units to reduce the circuit area of neural network inference engines. Additionally, we propose applying datapath pipelining and zero skipping to significantly reduce the required clock cycles. In recent years, studies have demonstrated the efficacy of neural networks in voice and image recognition applications; however, an extremely large number of multiply-and-accumulate operations are required in order to achieve high accuracy. Therefore, we explored the application of bit-serial arithmetic units to these operations to reduce circuit area. Bit-serial arithmetic is a method of sequentially calculating multi-bit data by inputting and outputting one bit at a time, which enables the reduction of the circuit area and amount of wiring. The disadvantage of this method is that it requires a large number of clock cycles. For example, a bit-serial multiplier with an input of N bits requires 2N cycles. In this study, pipeline processing and zero skipping were applied to reduce the required clock cycles. Zero skipping reduces the required clock cycles by skipping the calculation of an input activation when the value of that activation is zero. We propose two methods of zero skipping: reactive zero skipping, which checks whether activation is zero before the bit-serial operation starts, and proactive zero skipping, which reads ahead, examining subsequent memory locations, during the bit-serial operation and skips all consecutive zeros in one step. The effectiveness of zero skipping is highly dependent on the ratio of zeros in the input activation. In a convolutional neural network (CNN) that uses a rectified linear unit (ReLU) as the activation function, the input activation of the second and subsequent convolution layers has a high ratio of zeros. To further increase sparsity and improve the effectiveness of zero skipping, we propose setting the dropout rate during training as high as possible without affecting the recognition accuracy. We implemented a CNN using the proposed bit-serial arithmetic units and a CNN using conventional parallel arithmetic units, and compared their performances. The former exhibited a 22.9% smaller circuit area than the latter. In addition, the increase in the number of required clock cycles was limited to 2.12 times, and the clock period was reduced by 47.4%, resulting in a 7.8% reduction in runtime.