Article ID: 2025EAP1034
Quantization is an effective way to reduce memory and computational costs in the inference of convolutional neural networks. However, it remains unclear which model can achieve higher recognition accuracy while minimizing memory and computational costs: a large model (with a large number of parameters) quantized to an extremely low bit width (1 or 2 bits) or a small model (with a small number of parameters) quantized to a moderately low bit width (3, 4, or 5 bits). In this paper, we define a metric that combines the numbers of parameters and computations with the bit widths of quantized weight parameters. By utilizing this metric, we demonstrate that Pareto-optimal performance, where the best accuracy is attained at a given memory or computational cost, is achieved when a small model is moderately quantized, not when a large model is extremely quantized. Based on this finding, we empirically show that the Pareto frontier is improved by 4.3 × in a post-training quantization scenario for a quantized ResNet-50 model using the ImageNet dataset.