Ergodic policy search as a linear program
30 Dec 2017It is well known that policy search in ergodic markov decision processes can be formulated as a linear program (see Puterman, 1994, Section 8.8). The primal and dual formulations are summarized here for future references.
Policy search problem
In an ergodic MDP governed by stochastic dynamics $p(s’|s,a)$, a policy $\pi(a|s)$ induces a stationary state distribution $\mu(s)$ defined recursively as \begin{equation} \label{stationarity} \mu(s’) = \sum_{s,a} \mu(s) \pi(a|s) p(s’|s,a). \end{equation} The policy $\pi(a|s)$ and the state distribution $\mu(s)$ together define a stationary state-action distribution $\rho(s,a) = \mu(s) \pi(a|s)$.
The policy search problem consists in finding a policy $\pi(a|s)$ that maximizes the expectation of the reward $r(s,a)$ \begin{equation} \label{objective} J(\pi) = \sum_{s,a} \mu(s) \pi(a|s) r(s,a), \end{equation} provided that $\mu(s)$ and $\pi(a|s)$ are valid probability distributions satisfying the stationarity condition \eqref{stationarity}.
Linear program
Since recovering $\mu(s)$ from $\rho(s,a)$ is a linear operation and both the objective \eqref{objective} and the constraint \eqref{stationarity} are linear in $\rho(s,a)$, the complete problem can be compactly written as a linear program in $\rho(s,a)$ \begin{equation} \begin{aligned} \max_\rho & \quad \sum_{s,a} \rho(s,a) r(s,a) \newline \text{s.t.} & \quad \sum_{a’} \rho(s’,a’) = \sum_{s,a} \rho(s,a) p(s’|s,a), \; \forall s’, \newline & \quad \sum_{s,a} \rho(s,a) = 1, \newline & \quad \rho(s,a) \geq 0, \; \forall s, a. \end{aligned} \label{policy_search} \end{equation} Policy $\pi(a|s)$ can be recovered from $\rho(s,a)$ using Bayes’ rule \begin{equation*} \pi(a|s) = \frac{\rho(s,a)}{\sum_b \rho(s,b)}. \end{equation*} In the following, the primal and the corresponding dual problems are considered in more detail.
Primal
Treating $(s,a) \in S \times A$ as indices, we can view $r$ as a vector in $\mathbb{R}^{|S| \times |A|}$. Similarly, we introduce a vector $x$ in place of $\rho$ (to adhere to the convention of calling the unknown $x$), a matrix $P$ with indices $((s,a), s’)$ in place of $p(s’|s,a)$, and a matrix $C$ of the same shape as $P$ that performs summation over $a$ on the left-hand side of the stationarity constraint (more precisely, its transpose performs the summation). With these definitions, we can rewrite Problem \eqref{policy_search} in the standard vector-matrix form \begin{equation} \begin{aligned} \max_x & \quad J(x) = r^T x \newline \text{s.t.} & \quad C^T x = P^T x, \newline & \quad \mathbf{1}^T x = 1, \newline & \quad x \geq 0. \end{aligned} \label{primal} \end{equation} Additional insight can be gained by considering the dual problem.
Dual
The Lagrangian of Problem \eqref{primal} is given by \begin{equation} \label{Lagrangian} L = r^T x - v^T (C^T - P^T) x - \lambda(\mathbf{1}^T x - 1) + \kappa^T x, \end{equation} and the corresponding dual problem reads \begin{equation} \begin{aligned} \min_{v,\lambda,\kappa} & \quad g(v, \lambda, \kappa) = \lambda \newline \text{s.t.} & \quad (C-P)v = (r - \lambda\mathbf{1}) + \kappa, \newline & \quad \kappa \succeq 0. \end{aligned} \label{dual} \end{equation} There are several noteworthy observation one can make regarding this problem:
- The optimal dual variable $\lambda^\star$ equals the expected reward $\lambda^\star = r^T x^\star$, which can be proved by strong duality $L^\star = g^\star$.
- The dual variable $v \in \mathbb{R}^{|S|}$ is not uniquely defined. Indeed, the Lagrangian \eqref{Lagrangian} is invariant with respect to translations $v \mapsto v + \alpha \mathbf{1}$ for any $\alpha \in \mathbb{R}$, which follows from the fact that $P$ is a stochastic matrix (hence, $P\mathbf{1}=\mathbf{1}$) and $C$ is defined such that $C\mathbf{1}=\mathbf{1}$. One can use this degree of freedom to impose an additional condition on $v$; for example, by requiring $J^\star \triangleq (\mu^\star)^T v^\star$, one can endow $v^\star$ with the properties of the value function. Other choices are possible: for example, $0 \triangleq (\mu^\star)^T v^\star$ is a valid choice; however, $v$ cannot be interpreted as the value function corresponding to $\pi^\star$ in this case anymore.