Data Efficient Deep Reinforcement Learning through Model-Based Intrinsic Motivation

Nylend, Mikkel Sannes

dc.contributor.advisor	Ruocco, Massimiliano
dc.contributor.advisor	Chandra, Arjun
dc.contributor.advisor	Castejon, Humberto
dc.contributor.author	Nylend, Mikkel Sannes
dc.date.accessioned	2017-09-12T14:00:24Z
dc.date.available	2017-09-12T14:00:24Z
dc.date.created	2017-06-08
dc.date.issued	2017
dc.identifier	ntnudaim:16201
dc.identifier.uri	http://hdl.handle.net/11250/2454351
dc.description.abstract	In the last few years we have experienced great advances in the field of reinforcement learning (RL), much thanks to deep learning. By introducing deep neural networks in RL it is possible to have agents learn complex behaviors by just observing a game screen, just like humans learn to play games. Even though this is great, there is one limitation that makes the transition to real world problems tough, and that is data efficiency. One way to go about improving the data efficiency of RL is to approximate a model of the environment, called model-based RL. Even though model-based agents can be more data efficient, they are usually computationally heavy and often end up being too inaccurate. In this thesis, we explore the use of deep dynamics models (DDM) trained dynamically in environments with high-dimensional state representations. Furthermore, we study four different ways of calculating curiosity-based intrinsic motivation extracted from the DDM to achieve more efficient exploration. Having the DDM made up of an autoencoder (AE) and a transition prediction model that operate in the latent space generated by the AE, we introduce the first intrinsic bonus as the AE reconstruction error. The second one is based on the prediction error from the DDM. The third bonus introduce a novel idea of using MC dropout, presented in (Gal & Ghahramani 2015), to extract the uncertainty of the DDM. The last type of intrinsic bonus extract the uncertainty by MC dropout from a bootstrapped DDM. Interestingly, the proposed bonus based on MC dropout outperforms the more commonly used bonus based on dynamics prediction errors in both data efficiency and final performance in the Atari 2600 domain. Additionally, we manage to have agents learn by only receiving intrinsic reward and no any extrinsic rewards from the environment.
dc.language	eng
dc.publisher	NTNU
dc.subject	Datateknologi (2 årig), Kunstig intelligens
dc.title	Data Efficient Deep Reinforcement Learning through Model-Based Intrinsic Motivation
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: 16201_FULLTEXT.pdf
Størrelse:: 3.319Mb
Format:: PDF

Åpne

Filnavn:: 16201_COVER.pdf
Størrelse:: 1.597Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6769]

Vis enkel innførsel