Discounted reward vs average reward reinforcement learning

29 Apr 2017

The derivation presented here can be found in Sutton & Barto, Chapter 10.4 in a slightly different form.

Consider an ergodic MDP $\mathcal{M}=(\mathcal{S}, \mathcal{A}, \mathcal{P}, r)$ with state space $\mathcal{S}$, action space $\mathcal{A}$, stochastic transition dynamics $\mathcal{P}$, and reward $r\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}$. Policy $\pi$ is a conditional distribution over actions given states. The stationary state distribution $\mu_\pi$ under policy $\pi$ satisfies

$\int_{S'} d\mu_\pi(s') = \int_{S' \times \mathcal{S} \times \mathcal{A}} d\mathcal{P}(s' | s,a) \,d\mu_\pi(s) \,d\pi(a|s)$

for any $S’ \subset \mathcal{S}$, which can be succinctly written as $\mathbb{E}_{\mu_\pi} = \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}}$. The $v$-function for policy $\pi$ is defined as $v_\pi(s) \triangleq \mathbb{E}_{\pi, \mathcal{P}} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$ for $\gamma \in (0, 1)$, and it satisfies the recursion $v_\pi(s) = \mathbb{E}_{\pi, \mathcal{P}} \left[ r(s,\cdot) + \gamma v_\pi \right]$. The $q$-function for policy $\pi$ is defined as $q_\pi(s,a) \triangleq \mathbb{E}_\mathcal{P} \left[ r(s,a) + \gamma v_\pi \right]$. Apparently, $v_\pi(s) = \mathbb{E}_\pi \left[ q_\pi(s,\cdot) \right]$. The goal of a reinforcement learning agent is to find a policy $\pi$ that maximizes the expected value $J(\pi) \triangleq \mathbb{E}_{\mu_\pi} [v_\pi] = \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$. Using the recursive definition of $v_\pi$,

$\begin{align*} \mathbb{E}_{\mu_\pi} [v_\pi] &= \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}} [r + \gamma v_\pi] \\ &= \mathbb{E}_{\pi, \mu_\pi} [r] + \gamma \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}} [v_\pi] \end{align*}$

and the stationarity condition $\mathbb{E}_{\mu_\pi} = \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}}$, we can express $J(\pi)$ as an expectation of $r$ instead of $v_\pi$,

$\mathbb{E}_{\mu_\pi}[v_\pi] = \frac{1}{1-\gamma} \mathbb{E}_{\pi, \mu_\pi}[r].$

Thus, for an ergodic MDP, the expected value of a policy is proportional to the expected reward under that policy.

Boris Belousov

Discounted reward vs average reward reinforcement learning