Discounting in ergodic MDPs

A way to introduce discounting in a given ergodic MDP $p(s’|s,a)$, as suggested by Sutton & Barto and Levine, is to add a transient ‘dead’ state into which an agent transitions with a fixed small probability $1 - \gamma$ at every time step and then gets reset according to an initial state distribution $\mu_0$. Such a modified MDP is then governed by the dynamics \begin{equation} \tilde{p}(s’|s,a) = (1-\gamma)\mu_0(s’) + \gamma p(s’|s,a). \label{dyn_mod} \end{equation} There is a number of important questions to ask regarding the modified MDP:

  1. How is the modified MDP related to the original one? More precisely, having solved the modified MDP, what can we say about the original MDP?
  2. Does $\gamma$ makes the dual variable $v$ corresponding to the value function unique?
  3. What is the interpretation of $\lambda^\star$ in the modified MDP? Does it still represent the expected return?

We will try to answer these questions in the following.

LP formulation

Using the notation from the post on ergodic policy search, we can write the primal problem for the modified MDP as \begin{equation} \begin{aligned} \max_x & \quad J(x) = r^T x \\ \text{s.t.} & \quad C^T x = (1-\gamma)\mu_0 + \gamma P^T x, \\ & \quad \mathbf{1}^T x = 1, \\ & \quad x \geq 0, \end{aligned} \label{primal_mod} \end{equation} with the corresponding Lagrangian \begin{equation} \label{Lagrangian_mod} L = r^T x - v^T (C^T - \gamma P^T) x + (1-\gamma)v^T\mu_0 - \lambda(\mathbf{1}^T x - 1) + \kappa^T x \end{equation} and the dual \begin{equation*} \begin{aligned} \min_{v,\lambda,\kappa} & \quad g(v, \lambda, \kappa) = \lambda + (1-\gamma)v^T\mu_0 \\ \text{s.t.} & \quad (C - \gamma P)v = (r - \lambda\mathbf{1}) + \kappa, \\ & \quad \kappa \succeq 0. \end{aligned} \label{dual_mod} \end{equation*} Contrary to the discounted infinite horizon formulation, introducing $\gamma$ in the average reward problem does not make $v$ uniquely defined. Moreover, one looses the uniqueness of $\lambda$ in addition.

Non-uniqueness of $v$ and $\lambda$

To convince yourself that $v$ and $\lambda$ are not uniquely defined, observe that the Lagrangian \eqref{Lagrangian_mod} is invariant with respect to a simultaneous transformation of $v$ and $\lambda$ \begin{equation} \begin{aligned} v &= v_0 + \alpha \mathbf{1}, \\ \lambda &= \lambda_0 - \alpha(1-\gamma) \end{aligned} \label{translation_mod} \end{equation} for any $\alpha \in \mathbb{R}$. Indeed, substituting \eqref{translation_mod} into \eqref{Lagrangian_mod}, we obtain \begin{align*} L &= r^T x - (v_0 + \alpha \mathbf{1})^T (C^T - \gamma P^T) x + (1-\gamma)(v_0 + \alpha \mathbf{1})^T\mu_0 \\ & - (\lambda_0 - \alpha(1-\gamma))(\mathbf{1}^T x - 1) + \kappa^T x, \end{align*} which, after opening parentheses, confirms the invariance \begin{equation*} L = L_0 - \alpha(1-\gamma)\mathbf{1}^T x + \alpha(1-\gamma)\mathbf{1}^T \mu_0 + \alpha(1-\gamma)(\mathbf{1}^T x - 1) = L_0. \end{equation*} In contrast to the original MDP, where $v$ and $\lambda$ were independent, in the modified MDP they are coupled. In both cases, however, there is only one degree of freedom in the problem, which in the original MDP corresponds to translations of $v$ and in the modified MDP is shared between $v$ and $\lambda$ according to \eqref{translation_mod}.

In the original MDP, $\lambda^\star$ could be identified with the expected return and $v^\star$—upon imposing an additional constraint—with the value function. In the modified MDP, neither $\lambda^\star$ nor $v^\star$ automatically carry a special connotation; however, we can impose one scalar constraint, as before, although now we have to choose whether to endow $\lambda^\star$ or $v^\star$ with a convenient semantics.

Interpretation of $\lambda^\star$

The optimal dual variable $\lambda^\star$ always satisfies the condition \begin{equation} \lambda^\star = r^T x^\star - (1-\gamma)\mu_0^T v^\star \label{lambda} \end{equation} that follows from strong duality $L^\star = g^\star$. If we want to keep the interpretation of $\lambda^\star$ as the expected reward, we need to zero the second term in \eqref{lambda} \begin{equation*} \mu_0^T v^\star \triangleq 0. \end{equation*} Given a solution $(v^\star_0, \lambda^\star_0)$, check that the solution $(v^\star, \lambda^\star)$ obtained using the transformation \eqref{translation_mod} with $\alpha$ given by \begin{equation*} \alpha = -\mu_0^T v_0^\star \end{equation*} leads to the required condition on $\lambda^\star$ \begin{equation*} \lambda^\star = r^T x^\star. \end{equation*} The optimal dual variable $v^\star$ doesn’t have a nice interpretation of the value function in this case because $(\mu^\star)^T v^\star \neq r^T x^\star$.

Interpretation of $v^\star$

If we want to view $v^\star$ as the value function, its expectation should equal the expected return \begin{equation} r^T x^\star \triangleq (\mu^\star)^T v^\star. \label{value_function} \end{equation} Substituting $v^\star = v_0^\star + \alpha\mathbf{1}$ instead of $v^\star$, we obtain \begin{equation*} r^T x^\star = (\mu^\star)^T (v_0^\star + \alpha\mathbf{1}) = (\mu^\star)^T v_0^\star + \alpha. \end{equation*} Expressing $r^T x^\star$ through $(v^\star_0, \lambda^\star_0)$ from \eqref{lambda}, we arrive at the condition on $\alpha$ \begin{equation} \alpha = \lambda^\star_0 + \left( (1-\gamma)\mu_0 - \mu^\star \right)^T v^\star_0. \label{alpha} \end{equation} Thus, once a solution $(v^\star_0, \lambda^\star_0)$ has been found, we can compute $\alpha$ according to \eqref{alpha} and obtain a new solution $(v^\star, \lambda^\star)$ using \eqref{translation_mod} that satisfies condition \eqref{value_function}. The dual variable $v^\star$ can then be identified with the value function, whereas $\lambda^\star$, in contrast, looses its nice interpretation as the expected return.

To sum up, $v$ and $\lambda$ are defined up to a transformation \eqref{translation_mod}, therefore their optimal values do not carry any particular meaning unless we impose an additional constraint on them. Importantly, we cannot simultaneously interpret $v^\star$ as the value function and $\lambda^\star$ as the expected return, contrary to the original case $\gamma = 1$. Note also that $x^\star$ obtained for the modified MDP is different from $x^\star$ for the original MDP; therefore, the expected return of the policy that is optimal for the modified MDP is different from the expected return of the same policy in the original MDP.

Explicit vs implicit constraint

A careful reader should have noticed that the constraint we used in the primal problem \eqref{primal_mod} does not exactly correspond to the modified MDP dynamics \eqref{dyn_mod}. Namely, the dynamics \eqref{dyn_mod} in matrix form should be written as \begin{equation*} \tilde{P}^T = (1-\gamma) \mu_0 \mathbf{1}^T + \gamma P^T, \end{equation*} and, therefore, the constraint should take the form \begin{equation} C^T x = (1-\gamma) \mu_0 \mathbf{1}^T x + \gamma P^T x = \tilde{P} x. \label{dyn_con_impl} \end{equation} We call this constraint implicit (because $\gamma$ is implicit here) to discern it from the explicit constraint in \eqref{primal_mod}. Problem \eqref{primal_mod} with the implicit constraint is equivalent to the original problem with $P$ replaced by $\tilde{P}$. The corresponding dual variables $\tilde{v}$ and $\tilde{\lambda}$ have the same interpretations as for the original problem but relative to the MDP with a modified dynamics. It means, in particular, that $\tilde{\lambda}$ equals the expected reward in the modified MDP and $\tilde{v}$ is shift-invariant. It is straightforward to check by directly comparing Lagrangians of the implicit and explicit formulations that $\tilde{v}^\star = v^\star$ and $\tilde{\lambda}^\star = \lambda^\star + (1-\gamma) \mu_0^T v^\star$; hence, the dual variables are related by a shift of $\lambda$. Therefore, both formulations are indistinguishable as far as the value function and the primal variables are concerned; only if one cares about the interpretation of $\lambda$ are they mildly different.

Effect of $\gamma$ on optimum

Is the optimal policy for the modified MDP still optimal in the original MDP? It is easy to convince yourself that the answer is negative: the optimal policy for the modified MDP can be wildly different from the optimal policy for the original problem. For example, consider the classical chain problem: if the reset probability $1-\gamma$ is so high that the probability of traversing the whole chain is very small, the optimal policy will dictate to never leave the initial state, although in the original problem the optimal policy is to always try to reach the final state. Not all is lost, though. At least for $\gamma$ close to $1$, the solution of the modified problem should be close to the solution of the original problem. It is apparent that the stationary state distribution $\mu$ continuously depends on $\gamma$ and $\mu_0$, nevertheless it is hard to make any quantitative statement about how much $x^\star$ or $v^\star$ change when $\gamma$ or $\mu_0$ are varied.

Conclusion

Adding random resets with probability $1-\gamma$ has a regularizing effect and may turn non-ergodic MDPs into ergodic ones. However, it does not make the dual variable $v$ corresponding to the dynamics constraint, commonly interpreted as the value function, uniquely defined. Even with $\gamma$, there is still one degree of freedom available that can be used to impose a scalar linear condition on $v^\star$.

There are two ways to formulate the optimization problem corresponding to the modified MDP: with an explicit dynamics constraint (as in \eqref{primal_mod}) and with an implicit one \eqref{dyn_con_impl}. The explicit formulation has the advantage that it makes explicit the dependence of the dual variables on $\gamma$; however, it has the disadvantage that $\lambda^\star$ does not correspond to the expected return in this case. The implicit formulation makes analysis trivial because we simply replace original dynamics $P$ with a modified one $\tilde{P}$; on the other hand, it hides the dependence on $\gamma$ and $\mu_0$ inside $\tilde{P}$, making comparison to the original problem harder.

No matter whether implicit or explicit formulation is used, the primal optimal point in both cases is the same, and it is, generally speaking, different from $x^\star$ of the original problem. Nevertheless, for $\gamma \to 1$, we can expect that the solutions of the original and modified problems do not differ too much.