Deriving the HJB equation

23 Nov 2017

Russ Tedrake mentions the Hamilton-Jacobi-Bellman equation in the course on Underactuated Robotics, forwarding the reader to Dynamic Programming and Optimal Control by Dimitri Bertsekas for a nice intuitive derivation, that starts from a discrete version of Bellman’s optimality principle yielding the HJB equation in a limit. Here, instead, the HJB equation is derived directly from the additivity of the long-term cost function, for Bellman’s optimality principle itself is a consequence of the additive cost structure.

Preliminaries

Consider a trajectory $x$ developing according to a differential equation

$\dot{x}(t) = f(x(t), u(t))$

driven by a controller $u$. Performance of a controller $u$ on a trajectory starting from an initial state $x(0)=x_{0}$ is measured by a cumulative cost

$J(x_0, u) = \int_0^\infty g(x(t), u(t)) \, dt$

defined as an integral of an immediate cost $g(x, u)$. A controller $u$ that minimizes the cumulative cost $J$ is called an optimal controller. It is denoted by

$u^*(t,x_0) = \arg \min_u J(x_0, u).$

Note that the optimal controller $u^{*}$ is parameterized by $x_0$. The corresponding optimal cumulative cost

$J^*(x_0) = \min_u J(x_0, u)$

viewed as a function of the initial state $x_0$ is known as the value function.

Derivation

Remarkably, $J^{*}$ considered as a function of $x_0$ turns out to satisfy a certain differential equation. Our goal in this section is to derive that equation.

For any admissible $u$, the integral in the definition of the cumulative cost can be split as follows

$J(x_{0},u)=\int_{0}^{\Delta t}g(x,u)\,dt +\int_{\Delta t}^{\infty}g(x,u)\,dt\approx g(x_0,u_0)\Delta t+J(x(\Delta t),u)$

where $u_0 = u(0)$. Approximating the next state from the differential equation as

$x(\Delta t)\approx x_{0}+f(x_0,u_0)\Delta t,$

we are able to develop $J$ into a series

$J(x(\Delta t),u)\approx J(x_0,u)+J_{x}(x_{0},u)\cdot f(x_{0},u_0)\Delta t,$

provided that $J$ is sufficiently smooth. Substituting it into the expression above, we obtain

$J(x_{0},u)\approx g(x_0,u_0)\Delta t+J(x_0,u)+J_{x}(x_0,u)\cdot f(x_0,u_0)\Delta t,$

which after simplification, in the limit $\Delta t \to 0$ gives

$g(x_0,u_0)+J_{x}(x_0,u)\cdot f(x_0,u_0)=0.$

This equality holds for any admissible controller $u$. In particular, for the optimal controller $u^*$, we obtain

$g(x_0,u_0^*)+J_x^*(x_0)\cdot f(x_0,u_0^*)=0.$

Since this equality is true for any initial state $x_0$ and any optimal initial action $u_0^{*} = u^{*}(0, x_0)$, we can rewrite it as

Hamilton-Jacobi-Bellman equation $$ \min_{u_0} \left[ g(x_0, u_0) + J^*_x(x_0) \cdot f(x_0, u_0) \right] = 0, \quad\forall x_0. $$

Note that minimization here is performed over a single action $u_0$, in contrast to the problem formulation we started with, where minimization is performed over functions $u$. As a consequence, the optimal controller $u^{*}$ is time-invariant, $u^{*}(t, x_0) = u^{*}(x_0)$, satisfying

$$ u^*(x_0) = \arg \min_{u_0} \left[ g(x_0, u_0) + J^*_x(x_0) \cdot f(x_0, u_0) \right], \quad\forall x_0. $$

For further discussion of dynamic programming, turn to the sources.

Boris Belousov

Deriving the HJB equation

Preliminaries

Derivation