Entropic proximal policy search

We have seen that the policy search problem in ergodic MDPs can be formulated as a linear program. In this note, we extend that basic formulation with an entropic penalty in the primal objective, following the general procedure described by Marc Teboulle in Entropic Proximal Mappings with Applications to Nonlinear Programming. This problem formulation generalizes the natural policy gradient, relative entropy policy search, and trust region policy optimization formulations. Check out f-Divergence constrained policy improvement for a more detailed exposition of the ideas presented in this note.

Policy iteration

Let $q(s,a)$ be a state-action distribution induced by a policy at a previous iteration of a generalized policy iteration algorithm that we describe in the following and which we call entropic proximal policy search (EPPS). We consider a single policy iteration step consisting of a policy evaluation and a policy improvement substeps. Using the notation from the ergodic policy search post, one iteration of the EPPS algorithm involves solving the following problem \begin{equation} \begin{aligned} \max_x & \quad J_\eta(x) \triangleq r^T x - \eta q^T f\left( \frac{x}{q} \right) \\ \text{s.t.} & \quad C^T x = P^T x, \\ & \quad \mathbf{1}^T x = 1, \\ & \quad x \geq 0, \end{aligned} \label{primal_epps} \end{equation} where $\eta > 0$ is an open parameter fixed within one iteration, $f$ is a convex function on $(0, \infty)$ satisfying $f(1) = 0$, and division in the argument of $f$ is understood elementwise. The penalty that we use here is known as $f$-divergence \begin{equation*} D_f(x \| q) \triangleq q^T f\left( \frac{x}{q} \right). \end{equation*} Following Teboulle, entropic penalties also include Bregman divergences, but we do not consider them in this note.

Policy improvement

The Lagrangian of Problem \eqref{primal_epps} \begin{equation} L = r^T x - \eta q^T f\left( \frac{x}{q} \right) - v^T (C^T - P^T) x - \lambda(\mathbf{1}^T x - 1) + \kappa^T x \label{Lagrangian_epps} \end{equation} is invariant with respect to translations of $v$ \begin{equation} v = v_0 + \alpha \mathbf{1}, \; \forall \alpha \in \mathbb{R}, \label{translation} \end{equation} similar to the unpenalized case; this invariance is irrelevant for policy improvement, therefore we postpone the discussion of it till later. Equating the derivative of \eqref{Lagrangian_epps} with respect to $x$ to zero, \begin{equation*} \frac{\partial L}{\partial x} = r - \eta f^\prime \left( \frac{x}{q} \right) - (C - P)v - \lambda\mathbf{1} + \kappa = 0, \end{equation*} we obtain the optimal $x^\star$ as a function of the dual variables \begin{equation} x^\star = q f_*^\prime \left( \frac{r - (C-P)v - \lambda\mathbf{1} + \kappa}{\eta} \right), \label{policy_improvement} \end{equation} where $f_*$ is the Legendre transform of $f$ and $f_*^\prime$ is its derivative. Thus, once we have the optimal values of the dual variables, we can compute the new state-action distribution $x^\star$ using Formula \eqref{policy_improvement}.

Policy evaluation

Before performing the policy improvement substep \eqref{policy_improvement}, we need to perform the policy evaluation substep, which consists in finding a solution of the dual problem \begin{equation} \DeclareMathOperator{\range}{range} \begin{aligned} \min_{v,\lambda,\kappa} & \quad g_\eta(v, \lambda, \kappa) \triangleq q^T f_* \left( \frac{r - (C-P)v - \lambda\mathbf{1} + \kappa}{\eta} \right) + \lambda \\ \text{s.t.} & \quad \arg{f^*} \in \range_{x\,\geq\,0}{f^\prime(x)}, \\ & \quad \kappa \succeq 0, \end{aligned} \label{dual_epps} \end{equation} that we obtain upon substitution of $x^\star$ into the Lagrangian \eqref{Lagrangian_epps} and application of Fenchel’s inequality (at optimum holding with equality). The constraint on the argument of $f_*$ in \eqref{dual_epps} is linear because derivative of a convex function is monotone (see Boyd & Vandenberghe), and the whole problem \eqref{dual_epps} is convex.

As pointed out above, the dual variable $v$ is not uniquely defined, but it doesn’t matter because all solutions from the linear subspace \eqref{translation} are equally good. If we want to interpret $v^\star$ as a value function, we can impose a constraint \begin{equation*} (x^\star)^T r = (x^\star)^T C v \end{equation*} in the dual, to ensure that $J^\star \triangleq r^T x^\star = (\mu^\star)^T v^\star$.

Introducing discounting as discussed in the post on discounting in ergodic MDPs does not make $v^\star$ unique but may help in regularizing an MDP if there are absorbing states or other pathologies breaking ergodicity.