Discounted reward vs average reward reinforcement learning

The derivation presented here can be found in Sutton & Barto, Chapter 10.4 in a slightly different form.

Consider an ergodic MDP M=(S,A,P,r) with state space S, action space A, stochastic transition dynamics P, and reward r:S×AR. Policy π is a conditional distribution over actions given states. The stationary state distribution μπ under policy π satisfies

Sdμπ(s)=S×S×AdP(s|s,a)dμπ(s)dπ(a|s)

for any SS, which can be succinctly written as Eμπ=Eπ,μπ,P. The v-function for policy π is defined as vπ(s)Eπ,P[t=0γtrt] for γ(0,1), and it satisfies the recursion vπ(s)=Eπ,P[r(s,)+γvπ]. The q-function for policy π is defined as qπ(s,a)EP[r(s,a)+γvπ]. Apparently, vπ(s)=Eπ[qπ(s,)]. The goal of a reinforcement learning agent is to find a policy π that maximizes the expected value J(π)Eμπ[vπ]=Eπ,μπ,P[t=0γtrt]. Using the recursive definition of vπ,

Eμπ[vπ]=Eπ,μπ,P[r+γvπ]=Eπ,μπ[r]+γEπ,μπ,P[vπ]

and the stationarity condition Eμπ=Eπ,μπ,P, we can express J(π) as an expectation of r instead of vπ,

Eμπ[vπ]=11γEπ,μπ[r].

Thus, for an ergodic MDP, the expected value of a policy is proportional to the expected reward under that policy.