Intrinsic Motivation from Distributional Reinforcement Learning
Abstract
Reinforcement learning is learning to behave optimally with respect to anexternal observer through interactions with an environment. An agent re-peatedly tries to accomplish a goal, each trial yielding some more infor-mation about the environment. Recent work by Bellemare et al. (2017)introduce a technique, C51, that extends the point estimate of future rewardto a probability distribution. This opens the door for new action-selectionschemes and exploration strategies. It is also a possible source for intrin-sic motivation, using uncertainty to generate directed exploration. Recentwork by Moerland et al. (2018) presents promising results when using dis-tributions to explore in a deterministic MDP setting by way of Thompsonsampling. Their results also prove that this way of representing returns area valid option to guide exploration. This thesis introduce a novel way ofcomputing intrinsic reward based on distributions from the C51 algorithm.The resulting intrinsic reward enables the agent to quickly explore a newenvironment, resulting in a performance on par with Moerland et al. (2018)in the randomized Chain environment.