# TD error vs advantage vs Bellman error

10 Aug 2017Assume a policy $\pi$ is fixed and we want to find the value function $v_\pi$ under this policy. If the state-space is continuous and high-dimensional, one can only hope to find an approximation of $v_\pi$. Let $v_\theta$ denote a function parametrized by a vector $\theta$ that we want to use as an approximation of $v_\pi$. One way to find an appropriate $\theta$ would be to minimize the mean squared value error (MSVE)

defined as an expectation with respect to a weighting distribution $\mu$. However, we usually donâ€™t know $v_\pi$ in advance. Therefore, different approximations to that objective have been proposed that substitute some surrogate target in place of $v_\pi$. To be able to talk about those methods, we need to introduce several definitions.

## Definitions

## Expected Bellman error

A natural question to ask is what will happen if we take an expectation of the Bellman error. To answer it, we should first decide with respect to which distribution the expectation should be computed. In the ergodic setting, the obvious choice is the stationary distribution $\mu$ that is defined by the property $\E_\mu = \E_{\mu, \pi, \mathcal{P}}$. The expectation of the error should be equal to zero

which gives us the equation

or, slightly rearranged,

the well-known result relating the expected reward and the value function in ergodic MDPs.

## Mean squared <your favourite> error

As mentioned in the introduction, several surrogates for the MSVE have been studied in reinforcements learning (RL). For example, the mean squared temporal difference error (MSTDE)

and the mean squared Bellman error (MSBE)

These expressions suggest that one more error function, somewhere between MSTDE and MSBE, can be defined. Namely, the mean squared advantage (MSA)

In a future post we will see where it comes from and what properties it has.