2024 Volume 2024 Issue FIN-033 Pages 177-184
This study investigates the mean-variance (MV) trade- off in reinforcement learning (RL), an instance of sequential decision-making under uncertainty. Our objective is to obtain MV-efficient policies, whose means and variances are located on the Pareto efficient frontier with respect to the MV trade- off; under this condition, any increase in the expected reward would necessitate a corresponding increase in variance, and vice versa. To this end, we propose a method that trains our policy to maximize the expected quadratic utility, defined as a weighted sum of the first and second moments of the rewards obtained through our policy. We subsequently demonstrate that the maximizer indeed qualifies as an MV-efficient policy. Previous studies that employ constrained optimization to address the MV trade-off have encountered computational challenges. However, our approach is more computationally efficient as it eliminates the need for gradient estimation of variance, a contributing factor to the double sampling issue observed in existing methodologies. Through experimentation, we validate the efficacy of our approach.