TD error vs advantage vs Bellman error

Assume a policy $\pi$ is fixed and we want to find the value function $v_\pi$ under this policy. If the state-space is continuous and high-dimensional, one can only hope to find an approximation of $v_\pi$. Let $v_\theta$ denote a function parametrized by a vector $\theta$ that we want to use as an approximation of $v_\pi$. One way to find an appropriate $\theta$ would be to minimize the mean squared value error (MSVE)

defined as an expectation with respect to a weighting distribution $\mu$. However, we usually don’t know $v_\pi$ in advance. Therefore, different approximations to that objective have been proposed that substitute some surrogate target in place of $v_\pi$. To be able to talk about those methods, we need to introduce several definitions.

Definitions

The time-difference (TD) error is defined as $$ \delta_\theta(s, a, s') = R(s, a, s') + \gamma v_\theta(s') - v_\theta(s), $$ where $R(s, a, s')$ is the reward and $\gamma$ is the discount factor.
The advantage is the expectation of the TD error with respect to the next state $s'$ $$ A_\theta(s, a) = \E_{s' \sim \mathcal{P}}\left[ \delta_\theta(s,a,s') \right] = \E_{s' \sim \mathcal{P}}\left[R(s, a, s') + \gamma v_\theta(s')\right] - v_\theta(s), $$ where $\mathcal{P} = \mathcal{P}(s' | s, a)$ is the transition dynamics.
The Bellman error is the expectation of the advantage with respect to the action $a$ $$ \epsilon_\theta(s) = \E_{a \sim \pi}\left[A_\theta(s, a)\right] = \E_{a \sim \pi, s' \sim \mathcal{P}} \left[R(s, a, s') + \gamma v_\theta(s')\right] - v_\theta(s), $$ where $\pi = \pi(a | s)$ is the policy.

Expected Bellman error

A natural question to ask is what will happen if we take an expectation of the Bellman error. To answer it, we should first decide with respect to which distribution the expectation should be computed. In the ergodic setting, the obvious choice is the stationary distribution $\mu$ that is defined by the property $\E_\mu = \E_{\mu, \pi, \mathcal{P}}$. The expectation of the error should be equal to zero

which gives us the equation

or, slightly rearranged,

the well-known result relating the expected reward and the value function in ergodic MDPs.

Mean squared <your favourite> error

As mentioned in the introduction, several surrogates for the MSVE have been studied in reinforcements learning (RL). For example, the mean squared time-difference error (MSTDE)

and the mean squared Bellman error (MSBE)

These expressions suggest that one more error function, somewhere between MSTDE and MSBE, can be defined. Namely, the mean squared advantage (MSA)

In a future post we will see where it comes from and what properties it has.