2022 年 142 巻 10 号 p. 1156-1165
Adam is one of the general optimization algorithms in neural networks. It can accelerate convergence speed while learning. It has, however, two problems. The first is that the final performance of a network trained by Adam, such as generalization ability, becomes worse than the one trained by SGD, in applications to large-scale networks. The second is that values of the learning rate tend to be large at the early learning stage; as a result, the values of network parameters, such as weights and bias, become too large by a first few iterations. In recent years, research has been conducted to solve these problems. AdaBound has been proposed for solving the first problem. This is a method switching dynamically from Adam to SGD. RAdam has also been proposed, for solving the second problem. This applies a method called WarmUp, which sets a small learning rate at the early learning stage and gradually increases it, to Adam. In this study, we propose to apply WarmUp to the upper limit of AdaBound's learning rate. The proposed algorithm prevents parameter updates at extremely large learning rates in the early learning stages. Therefore, more efficient learning can be expected than the conventional method. The proposed method has been applied to learning of some types of networks like CNN, ResNet, DenseNet and BERT. The results show that our method has improved performance compared to the traditional method, and an image classification task has shown a tendency to be more effective in large networks.
J-STAGEがリニューアルされました! https://www.jstage.jst.go.jp/browse/-char/ja/