Combining Q-learning and Deterministic Policy Gradient for Learning-Based MPC
Peer reviewed, Journal article
Published version
Permanent lenke
https://hdl.handle.net/11250/3132876Utgivelsesdato
2023Metadata
Vis full innførselSamlinger
Originalversjon
IEEE Conference on Decision and Control. Proceedings. 2023, 62 . 10.1109/CDC49753.2023.10383562Sammendrag
This paper considers adjusting a fully parametrized model predictive control (MPC) scheme to approximate the optimal policy for a system as accurately as possible. By adopting MPC as a function approximator in reinforcement learning (RL), the MPC parameters can be adjusted using Q-learning or policy gradient methods. However, each method has its own specific shortcomings when used alone. Indeed, Q-learning does not exploit information about the policy gradient and therefore may fail to capture the optimal policy, while policy gradient methods miss any cost function corrections not affecting the policy directly. The former is a general problem, whereas the latter is an issue when dealing with economic problems specifically. Moreover, it is notoriously difficult to perform second-order steps in the context of policy gradient methods while it is straightforward in the context of Q-learning. This calls for an organic combination of these learning algorithms, in order to fully exploit the MPC parameterization as well as speed up convergence in learning.