KL between trajectory distributions vs KL between policies
16 Apr 2017The derivation presented here is inspired by these lecture notes.
Given a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots, s_{T-1}, a_{T-1})$, and two distributions over trajectories $p_\pi$ and $p_q$ parametrized by policies $\pi$ and $q$ respectively,
where $\mu_0 = \mu_0(s_0)$ is the initial state distribution, $\pi_t = \pi_t(a_t | s_t)$ is the policy at time $t$, and $p_t = p(s_{t+1} | s_t, a_t)$ is the system dynamics, we can find the KL divergence between $p_\pi$ and $p_q$ as follows. By definition,
Denote $\tau_t = (s_0, a_0, \dots, s_{t-1}, a_{t-1})$. At time $t$, we have a state-action distribution $\rho^\pi_t = \rho^\pi_t(s_t, a_t)$ that can be computed as
Notice that it decomposes into the product $\rho^\pi_t(s_t, a_t) = \mu^\pi_t(s_t) \pi_t(a_t | s_t)$.
Coming back to the KL, observe that the ratio $p_\pi / p_q$ has a lot of common terms that cancel out, leading to
Interchanging the order of summation, we arrive at
So, finally, the KL divergence between trajectory distributions is the sum over time of state-averaged KL divergences between policies at different time steps,