This paper is concerned with developing policy gradient methods that gracefully scale up to challenging problems with high-dimensional state and action spaces. Towards this end, we develop a scheme that uses value functions to substantially reduce the variance of policy gradient estimates, while introducing a tolerable amount of bias. This scheme, which we call generalized advantage estimation (GAE), involves using a discounted sum of temporal difference residuals as an estimate of the advantage function, and can be interpreted as a type of automated cost shaping. It is simple to implement and can be used with a variety of policy gradient methods and value function approximators. Along with this variance-reduction scheme, we use trust region algorithms to optimize the policy and value function, both represented as neural networks. We present experimental results on a number of highly challenging 3D loco- motion tasks, where our approach learns complex gaits for bipedal and quadrupedal simulated robots. We also learn controllers for the biped getting up off the ground. In contrast to prior work that uses hand-crafted low-dimensional policy representations, our neural network policies map directly from raw kinematics to joint torques.
Added 3 years ago by Tejas Kulkarni