Data Efficient Deep Reinforcement Learning through Model-Based Intrinsic Motivation

Nylend, Mikkel Sannes

Nylend, Mikkel Sannes

Master thesis

Åpne

16201_FULLTEXT.pdf (3.319Mb)

16201_COVER.pdf (1.597Mb)

Permanent lenke

http://hdl.handle.net/11250/2454351

Utgivelsesdato

2017

Metadata

Vis full innførsel

Samlinger

Institutt for datateknologi og informatikk [6778]

Sammendrag

In the last few years we have experienced great advances in the field of reinforcement learning (RL), much thanks to deep learning. By introducing deep neural networks in RL it is possible to have agents learn complex behaviors by just observing a game screen, just like humans learn to play games. Even though this is great, there is one limitation that makes the transition to real world problems tough, and that is data efficiency.

One way to go about improving the data efficiency of RL is to approximate a model of the environment, called model-based RL. Even though model-based agents can be more data efficient, they are usually computationally heavy and often end up being too inaccurate. In this thesis, we explore the use of deep dynamics models (DDM) trained dynamically in environments with high-dimensional state representations. Furthermore, we study four different ways of calculating curiosity-based intrinsic motivation extracted from the DDM to achieve more efficient exploration.

Having the DDM made up of an autoencoder (AE) and a transition prediction model that operate in the latent space generated by the AE, we introduce the first intrinsic bonus as the AE reconstruction error. The second one is based on the prediction error from the DDM. The third bonus introduce a novel idea of using MC dropout, presented in (Gal & Ghahramani 2015), to extract the uncertainty of the DDM. The last type of intrinsic bonus extract the uncertainty by MC dropout from a bootstrapped DDM. Interestingly, the proposed bonus based on MC dropout outperforms the more commonly used bonus based on dynamics prediction errors in both data efficiency and final performance in the Atari 2600 domain. Additionally, we manage to have agents learn by only receiving intrinsic reward and no any extrinsic rewards from the environment.

Utgiver

NTNU