KL between trajectory distributions vs KL between policies

16 Apr 2017

The derivation presented here is inspired by these lecture notes.

Given a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots, s_{T-1}, a_{T-1})$, and two distributions over trajectories $p_\pi$ and $p_q$ parametrized by policies $\pi$ and $q$ respectively,

$p_\pi(\tau) = \mu_0 \prod_t \pi_t p_t,$

where $\mu_0 = \mu_0(s_0)$ is the initial state distribution, $\pi_t = \pi_t(a_t | s_t)$ is the policy at time $t$, and $p_t = p(s_{t+1} | s_t, a_t)$ is the system dynamics, we can find the KL divergence between $p_\pi$ and $p_q$ as follows. By definition,

$\begin{align*} D_{KL} (p_\pi || p_q) &= \sum_\tau p_\pi(\tau) \log \frac{p_\pi(\tau)}{p_q(\tau)}. \end{align*}$

Denote $\tau_t = (s_0, a_0, \dots, s_{t-1}, a_{t-1})$. At time $t$, we have a state-action distribution $\rho^\pi_t = \rho^\pi_t(s_t, a_t)$ that can be computed as

$\rho^\pi_t = \sum_{\tau_t} p_\pi(\tau_{t+1}).$

Notice that it decomposes into the product $\rho^\pi_t(s_t, a_t) = \mu^\pi_t(s_t) \pi_t(a_t | s_t)$.

Coming back to the KL, observe that the ratio $p_\pi / p_q$ has a lot of common terms that cancel out, leading to

$\begin{align*} D_{KL} (p_\pi || p_q) &= \sum_\tau p_\pi(\tau) \log \prod_t \frac{\pi_t}{q_t} = \sum_\tau \sum_t p_\pi(\tau) \log \frac{\pi_t}{q_t}. \end{align*}$

Interchanging the order of summation, we arrive at

$\newcommand{\E}{\mathbb{E}} \begin{align*} D_{KL} (p_\pi || p_q) &= \sum_t \sum_{\tau_{t+1}} p_\pi(\tau_{t+1}) \log \frac{\pi_t}{q_t} \\ &= \sum_t \sum_{s_t, a_t} \rho^\pi_t \log \frac{\pi_t}{q_t} \\ &= \sum_t \sum_{s_t} \mu^\pi_t \sum_{a_t} \pi_t \log \frac{\pi_t}{q_t} \\ &= \sum_t \E_{\mu^\pi_t} D_{KL} (\pi_t || q_t). \end{align*}$

So, finally, the KL divergence between trajectory distributions is the sum over time of state-averaged KL divergences between policies at different time steps,

$D_{KL} (p_\pi || p_q) = \sum_t \E_{\mu^\pi_t} D_{KL} (\pi_t || q_t).$

Boris Belousov

KL between trajectory distributions vs KL between policies