Vis enkel innførsel

dc.contributor.authorBøhn, Eivind Eigil
dc.contributor.authorMoe, Signe
dc.contributor.authorJohansen, Tor Arne
dc.date.accessioned2020-11-09T08:58:37Z
dc.date.available2020-11-09T08:58:37Z
dc.date.created2020-11-06T15:17:44Z
dc.date.issued2020
dc.identifier.issn2405-8963
dc.identifier.urihttps://hdl.handle.net/11250/2686871
dc.description.abstractReinforcement learning in domains with sparse rewards is a difficult problem, and a large part of the training process is often spent searching the state space in a more or less random fashion for learning signals. For control problems, we often have some controller readily available which might be suboptimal but nevertheless solves the problem to some degree. This controller can be used to guide the initial exploration phase of the learning controller towards reward yielding states, reducing the time before refinement of a viable policy can be initiated. To achieve such an exploration guidance while also allowing the learning controller to outperform the demonstrations provided to it, Nair et al. (2017) proposes to use a ”Q-filter” to select states where the agent should clone the behaviour of the demonstrations. The Q-filter selects states where the critic deems the demonstrations to be superior to the agent, providing a natural way to adjust the guidance in a manner that is adaptive to the proficiency of the demonstrator. The contribution of this paper lies in adapting the Q-filter concept from pre-recorded demonstrations to an online guiding controller, and further in identifying shortcomings in the formulation of the Q-filter and suggesting some ways these issues can be mitigated — notably by replacing the value comparison baseline with the guiding controller’s own value function — reducing the effects of stochasticity in the neural network value estimator. These modifications are tested on the OpenAI Gym Fetch environments, showing clear improvements in adaptivity and yielding increased performance in all robotics environments tested.en_US
dc.language.isoengen_US
dc.publisherElsevieren_US
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Internasjonal*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/deed.no*
dc.titleAccelerating Reinforcement Learning with Suboptimal Guidanceen_US
dc.typePeer revieweden_US
dc.typeJournal articleen_US
dc.description.versionacceptedVersionen_US
dc.source.journalIFAC-PapersOnLineen_US
dc.identifier.cristin1845718
dc.relation.projectNorges forskningsråd: 223254en_US
dc.relation.projectNorges forskningsråd: 272402en_US
dc.description.localcode© 2020. This is the authors’ accepted and refereed manuscript to the article. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/en_US
cristin.ispublishedtrue
cristin.fulltextpostprint
cristin.qualitycode1


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal
Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal