Article ID: 2024EDP7235
Mean Variance Estimation networks are models capable of predicting not only the mean but also the variance of a distribution. A recent study has demonstrated that using separate subnetworks for predicting the mean and variance, and training the subnetwork for predicting the mean first (a process called a warm-up) before training the subnetwork for predicting the variance, is more effective than using a single network. However, that study has only utilized the Adam optimizer for training and has not explored quasi-Newton methods, nor varied the subnetwork structures. In this study, we introduce the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method for training a Mean Variance Estimation network and examines how the selection of subnetwork structures affects performance. We conducted an experiment using a synthetic dataset and 11 experiments using realworld datasets to compare the performance of Adam, BFGS, and three other learning methods, including AdaHessian. Out of the 11 experiments using real-world datasets, BFGS outperformed Adam and AdaHessian in seven cases. This also reveals that BFGS tended to perform better on datasets with a larger number of data points. While underfitting was a problem with learning methods other than BFGS, overfitting was a concern with BFGS when it did not achieve the best performance. This overfitting issue can be mitigated with techniques such as early stopping and regularization. Additionally, BFGS required more hidden units for the subnetwork for predicting the mean than for the subnetwork for predicting the variance, and even 0 hidden units were selected as the optimal number for the subnetwork for predicting the variance. It was also observed that, for the subnetwork for predicting the variance, BFGS tended to select more compact models compared to other methods.