Intrinsic Motivation from Distributional Reinforcement Learning

2018

Reinforcement learning is learning to behave optimally with respect to an

external observer through interactions with an environment. An agent re-

peatedly tries to accomplish a goal, each trial yielding some more infor-

mation about the environment. Recent work by Bellemare et al. (2017)

introduce a technique, C51, that extends the point estimate of future reward

to a probability distribution. This opens the door for new action-selection

schemes and exploration strategies. It is also a possible source for intrin-

sic motivation, using uncertainty to generate directed exploration. Recent

work by Moerland et al. (2018) presents promising results when using dis-

tributions to explore in a deterministic MDP setting by way of Thompson

sampling. Their results also prove that this way of representing returns are

a valid option to guide exploration. This thesis introduce a novel way of

computing intrinsic reward based on distributions from the C51 algorithm.

The resulting intrinsic reward enables the agent to quickly explore a new

environment, resulting in a performance on par with Moerland et al. (2018)

in the randomized Chain environment.

NTNU