TD error vs advantage vs Bellman error
10 Aug 2017Assume a policy $\pi$ is fixed and we want to find the value function $v_\pi$ under this policy. If the state-space is continuous and high-dimensional, one can only hope to find an approximation of $v_\pi$. Let $v_\theta$ denote a function parametrized by a vector $\theta$ that we want to use as an approximation of $v_\pi$. One way to find an appropriate $\theta$ would be to minimize the mean squared value error (MSVE)
defined as an expectation with respect to a weighting distribution $\mu$. However, we usually don’t know $v_\pi$ in advance. Therefore, different approximations to that objective have been proposed that substitute some surrogate target in place of $v_\pi$. To be able to talk about those methods, we need to introduce several definitions.
Definitions
Expected Bellman error
A natural question to ask is what will happen if we take an expectation of the Bellman error. To answer it, we should first decide with respect to which distribution the expectation should be computed. In the ergodic setting, the obvious choice is the stationary distribution $\mu$ that is defined by the property $\mathbb{E}_{\mu} = \mathbb{E}_{\mu, \pi, \mathcal{P}}$. The expectation of the error should be equal to zero
which gives us the equation
or, slightly rearranged,
the well-known result relating the expected reward and the value function in ergodic MDPs.
Mean squared <your favourite> error
As mentioned in the introduction, several surrogates for the MSVE have been studied in reinforcements learning (RL). For example, the mean squared temporal difference error (MSTDE)
and the mean squared Bellman error (MSBE)
These expressions suggest that one more error function, somewhere between MSTDE and MSBE, can be defined. Namely, the mean squared advantage (MSA)
In a future post we will see where it comes from and what properties it has.