$\alpha$-Divergence between Gaussians
18 Aug 2017The $\alpha$-divergence between distributions $P$ and $Q$ with densities $p$ and $q$ is defined as
\begin{equation} \newcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \label{a-div} D_\alpha(p\|q) = \E_q \left[ f_\alpha \left( \frac{p}{q} \right) \right] \end{equation}
with
where $\alpha \in \R$.
Let $p(x) = N(x|\mu, \sigma^2)$ and $q(x) = N(x|\nu, \rho^2)$, then their ratio is
Substituting the expectation of the $\alpha$’s power of this ratio
into $D_\alpha$, we obtain
A curious fact is that this expression is guaranteed to be real only when $\alpha \in [0, 1]$. The $\alpha$-divergence \eqref{a-div}, on the other hand, is defined for any $\alpha \in \R$. One can nevertheless use \eqref{D} even for $\alpha \notin [0, 1]$, but it imposes certain restrictions on the variances of the Gaussians. Namely, the argument of the square root should be positive,
\begin{equation} \label{var} \alpha \rho^2 + (1-\alpha) \sigma^2 > 0. \end{equation}
For example, the Pearson $\chi^2$ divergence, corresponding to $\alpha = 2$, can be computed according to this formula only if $\sigma^2 < 2\rho^2$. This sounds a bit restrictive, and it is not quite clear where such limitation comes from.
KL divergence: $\alpha \to 1$
The KL divergence corresponds to the limit
We encourage the curious reader to derive this result on her own and then compare it with Wikipedia.
Hellinger distance: $\alpha = \frac{1}{2}$
We obtain the Hellinger distance when substituting $\alpha = \frac{1}{2}$ into $D_\alpha$,
Wikipedia uses a different scaling factor in the definition of the Hellinger distance, that is why they don’t have the $4$ in the front.
Pearson divergence: $\alpha = 2$
Assuming $\sigma^2 < 2\rho^2$, we can write
We can also compute it directly by taking the expectation of $f_2(x) = \frac{1}{2}(x-1)^2$,
Not surprisingly, we arrive at the same result. So, it seems the Pearson $\chi^2$ divergence is only well-defined for certain pairs of Gaussians; namely, for those satisfying $\sigma^2 < 2\rho^2$. Otherwise, the divergence becomes complex, and it’s not clear how to interpret that.
Variance constraint for $\alpha > 1$
For $\alpha > 1$, condition \eqref{var} implies the inequality
that says that the variance of $p$ should be smaller than the variance of $q$ times a constant factor. Curiously, when $\alpha \to \infty$, the right-hand side tends to $\rho^2$, leading to the inequality $\sigma^2 \leq \rho^2$. Thus, we come to the following conclusion:
Variance constraint for $\alpha < 0$
The $\alpha$-divergence has a certain symmetry property. Namely, replacing $\alpha > 1$ by an $\alpha$ symmetric with respect to $\frac{1}{2}$ is equivalent to swapping $p$ and $q$. For example, $D_1(p\|q) = D_0(q\|p)$ or $D_2(p\|q) = D_{-1}(q\|p)$. Therefore, we can transfer our conclusions from the case $\alpha > 1$ to the case $\alpha < 0$ by swapping the roles of $\sigma$ and $\rho$.