# Discounted reward vs average reward reinforcement learning

The derivation presented here can be found in Sutton & Barto, Chapter 10.4 in a slightly different form.

Consider an ergodic MDP $\mathcal{M}=(\mathcal{S}, \mathcal{A}, \mathcal{P}, r)$ with state space $\mathcal{S}$, action space $\mathcal{A}$, stochastic transition dynamics $\mathcal{P}$, and reward $r\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}$. Policy $\pi$ is a conditional distribution over actions given states. The stationary state distribution $\mu_\pi$ under policy $\pi$ satisfies

for any $S’ \subset \mathcal{S}$, which can be succinctly written as $\mathbb{E}_{\mu_\pi} = \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}}$. The $v$-function for policy $\pi$ is defined as $v_\pi(s) \triangleq \mathbb{E}_{\pi, \mathcal{P}} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$ for $\gamma \in (0, 1)$, and it satisfies the recursion $v_\pi(s) = \mathbb{E}_{\pi, \mathcal{P}} \left[ r(s,\cdot) + \gamma v_\pi \right]$. The $q$-function for policy $\pi$ is defined as $q_\pi(s,a) \triangleq \mathbb{E}_\mathcal{P} \left[ r(s,a) + \gamma v_\pi \right]$. Apparently, $v_\pi(s) = \mathbb{E}_\pi \left[ q_\pi(s,\cdot) \right]$. The goal of a reinforcement learning agent is to find a policy $\pi$ that maximizes the expected value $J(\pi) \triangleq \mathbb{E}_{\mu_\pi} [v_\pi] = \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]$. Using the recursive definition of $v_\pi$,

and the stationarity condition $\mathbb{E}_{\mu_\pi} = \mathbb{E}_{\pi, \mu_\pi, \mathcal{P}}$, we can express $J(\pi)$ as an expectation of $r$ instead of $v_\pi$,

Thus, for an ergodic MDP, the expected value of a policy is proportional to the expected reward under that policy.