Deriving the HJB equation
23 Nov 2017Russ Tedrake mentions the Hamilton-Jacobi-Bellman equation in the course on Underactuated Robotics, forwarding the reader to Dynamic Programming and Optimal Control by Dimitri Bertsekas for a nice intuitive derivation, that starts from a discrete version of Bellman’s optimality principle yielding the HJB equation in a limit. Here, instead, the HJB equation is derived directly from the additivity of the long-term cost function, for Bellman’s optimality principle itself is a consequence of the additive cost structure.
Preliminaries
Consider a trajectory $x$ developing according to a differential equation
driven by a controller $u$. Performance of a controller $u$ on a trajectory starting from an initial state $x(0)=x_{0}$ is measured by a cumulative cost
defined as an integral of an immediate cost $g(x, u)$. A controller $u$ that minimizes the cumulative cost $J$ is called an optimal controller. It is denoted by
Note that the optimal controller $u^{*}$ is parameterized by $x_0$. The corresponding optimal cumulative cost
viewed as a function of the initial state $x_0$ is known as the value function.
Derivation
Remarkably, $J^{*}$ considered as a function of $x_0$ turns out to satisfy a certain differential equation. Our goal in this section is to derive that equation.
For any admissible $u$, the integral in the definition of the cumulative cost can be split as follows
where $u_0 = u(0)$. Approximating the next state from the differential equation as
we are able to develop $J$ into a series
provided that $J$ is sufficiently smooth. Substituting it into the expression above, we obtain
which after simplification, in the limit $\Delta t \to 0$ gives
This equality holds for any admissible controller $u$. In particular, for the optimal controller $u^*$, we obtain
Since this equality is true for any initial state $x_0$ and any optimal initial action $u_0^{*} = u^{*}(0, x_0)$, we can rewrite it as
Note that minimization here is performed over a single action $u_0$, in contrast to the problem formulation we started with, where minimization is performed over functions $u$. As a consequence, the optimal controller $u^{*}$ is time-invariant, $u^{*}(t, x_0) = u^{*}(x_0)$, satisfying
For further discussion of dynamic programming, turn to the sources.