Accelerating Reinforcement Learning with Suboptimal Guidance

Bøhn, Eivind Eigil; Moe, Signe; Johansen, Tor Arne

dc.contributor.author	Bøhn, Eivind Eigil
dc.contributor.author	Moe, Signe
dc.contributor.author	Johansen, Tor Arne
dc.date.accessioned	2020-11-09T08:58:37Z
dc.date.available	2020-11-09T08:58:37Z
dc.date.created	2020-11-06T15:17:44Z
dc.date.issued	2020
dc.identifier.issn	2405-8963
dc.identifier.uri	https://hdl.handle.net/11250/2686871
dc.description.abstract	Reinforcement learning in domains with sparse rewards is a difficult problem, and a large part of the training process is often spent searching the state space in a more or less random fashion for learning signals. For control problems, we often have some controller readily available which might be suboptimal but nevertheless solves the problem to some degree. This controller can be used to guide the initial exploration phase of the learning controller towards reward yielding states, reducing the time before refinement of a viable policy can be initiated. To achieve such an exploration guidance while also allowing the learning controller to outperform the demonstrations provided to it, Nair et al. (2017) proposes to use a ”Q-filter” to select states where the agent should clone the behaviour of the demonstrations. The Q-filter selects states where the critic deems the demonstrations to be superior to the agent, providing a natural way to adjust the guidance in a manner that is adaptive to the proficiency of the demonstrator. The contribution of this paper lies in adapting the Q-filter concept from pre-recorded demonstrations to an online guiding controller, and further in identifying shortcomings in the formulation of the Q-filter and suggesting some ways these issues can be mitigated — notably by replacing the value comparison baseline with the guiding controller’s own value function — reducing the effects of stochasticity in the neural network value estimator. These modifications are tested on the OpenAI Gym Fetch environments, showing clear improvements in adaptivity and yielding increased performance in all robotics environments tested.	en_US
dc.language.iso	eng	en_US
dc.publisher	Elsevier	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.title	Accelerating Reinforcement Learning with Suboptimal Guidance	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	acceptedVersion	en_US
dc.source.journal	IFAC-PapersOnLine	en_US
dc.identifier.cristin	1845718
dc.relation.project	Norges forskningsråd: 223254	en_US
dc.relation.project	Norges forskningsråd: 272402	en_US
dc.description.localcode	© 2020. This is the authors’ accepted and refereed manuscript to the article. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/	en_US
cristin.ispublished	true
cristin.fulltext	postprint
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: eeb_ifac.pdf
Størrelse:: 2.492Mb
Format:: PDF
Beskrivelse:: Moe

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for teknisk kybernetikk [3741]
Publikasjoner fra CRIStin - NTNU [38228]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal