Discounted reward vs average reward reinforcement learning
29 Apr 2017
The derivation presented here can be found in
Sutton & Barto, Chapter 10.4 in a slightly different form.
Consider an ergodic MDP
with state space , action space ,
stochastic transition dynamics
, and reward .
Policy is a conditional distribution over actions given states.
The stationary state distribution under policy satisfies
for any ,
which can be succinctly written as
.
The -function for policy is defined as
for ,
and it satisfies the recursion
.
The -function for policy is defined as
.
Apparently, .
The goal of a reinforcement learning agent is to find a policy
that maximizes the expected value
.
Using the recursive definition of ,
and the stationarity condition
,
we can express as an expectation of instead of ,
Thus, for an ergodic MDP, the expected value of a policy is proportional
to the expected reward under that policy.