Deep Reinforcement Learning for Model-Free Continuous Control with an Emphasis on Trust Region Policy Optimization
Abstract
Reinforcement learning is a general framework for optimizing the behavioural policy of an agent in an environment that issues a scalar reward indicating how well the agent is performing. Reinforcement learning algorithms can be coarsely divided into two groups based on whether they incorporate models of the state transition dynamics or not. While models enable a designer to embed prior domain knowledge and thus reduce the sample complexity of the resulting algorithm, the model-free regime provides a conceptually elegant formulation for solving tasks in problem domains where such models are not as easily expressed.
The goal of this thesis was to investigate the merits of model-free deep reinforcement learning in continuous action spaces. Behavioural policies were represented by artificial neural networks, a popular class of flexible function approximators. A literature study was performed and references to state-of-the-art algorithms were provided. The advantages and disadvantages of the approach were discussed with a basis in experiments conducted in the Mujoco simulator with the Trust Region Policy Optimization algorithm.
The results showed that efficient utilization of computational resources were important. A novel method for computing Gauss-Newton vector products with reverse mode automatic differentiation engines was derived. In addition, an efficient action sampling scheme using batches was proposed. The scheme resulted in a 3-fold reduction in total training time. Particularly variance reduction techniques and sufficiently large time horizons were found to be important for the performance of the policy.