モデル化誤差に頑健な Max-Min Off-Policy Actor-Critic

田邊 拓実; 佐藤 怜; 福地 一斗; 佐久間 淳; 秋本 洋平

doi:10.11517/pjsai.JSAI2022.0_2C5GS203

Abstract

In reinforcement learning, since it is costly and risky to training policies in the real-world, policies trained in a simulation environment are often transferred to the real-world. However, because the simulation environment does not perfectly mimic the real-world environment, modeling errors may occur. We focus on scenarios where a simulation environment including an uncertainty parameter and a set of its possible values are available. The objective is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment, provided that it is included in the uncertainty parameter set. We propose the Max-min Twin Delayed Deep Deterministic Policy Gradient Algorithm (M2TD3) and its soft variant (SoftM2TD3) to solve the max-min optimization problem in order to obtain a policy that optimizes the worst-case performance. Experiments in the MuJoCo environments show that the proposed method exhibited better worst-case performance than some baseline approaches.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!