# $\alpha$-Divergence between Gaussians

The $\alpha$-divergence between distributions $P$ and $Q$ with densities $p$ and $q$ is defined as

$$\newcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \label{a-div} D_\alpha(p\|q) = \E_q \left[ f_\alpha \left( \frac{p}{q} \right) \right]$$

with

where $\alpha \in \R$.

Let $p(x) = N(x|\mu, \sigma^2)$ and $q(x) = N(x|\nu, \rho^2)$, then their ratio is

Substituting the expectation of the $\alpha$’s power of this ratio

into $D_\alpha$, we obtain

$$\label{D} D_\alpha(p\|q) = \frac{1}{\alpha(1-\alpha)} \left( 1 - \frac{\rho^\alpha \sigma^{1-\alpha}} {\sqrt{\alpha \rho^2 + (1-\alpha) \sigma^2}} e^{-\frac{\alpha(1-\alpha)}{\alpha \rho^2 + (1-\alpha) \sigma^2} \frac{(\mu-\nu)^2}{2}} \right).$$

A curious fact is that this expression is guaranteed to be real only when $\alpha \in [0, 1]$. The $\alpha$-divergence \eqref{a-div}, on the other hand, is defined for any $\alpha \in \R$. One can nevertheless use \eqref{D} even for $\alpha \notin [0, 1]$, but it imposes certain restrictions on the variances of the Gaussians. Namely, the argument of the square root should be positive,

$$\label{var} \alpha \rho^2 + (1-\alpha) \sigma^2 > 0.$$

For example, the Pearson $\chi^2$ divergence, corresponding to $\alpha = 2$, can be computed according to this formula only if $\sigma^2 < 2\rho^2$. This sounds a bit restrictive, and it is not quite clear where such limitation comes from.

## KL divergence: $\alpha \to 1$

The KL divergence corresponds to the limit

We encourage the curious reader to derive this result on her own and then compare it with Wikipedia.

## Hellinger distance: $\alpha = \frac{1}{2}$

We obtain the Hellinger distance when substituting $\alpha = \frac{1}{2}$ into $D_\alpha$,

Wikipedia uses a different scaling factor in the definition of the Hellinger distance, that is why they don’t have the $4$ in the front.

## Pearson divergence: $\alpha = 2$

Assuming $\sigma^2 < 2\rho^2$, we can write

We can also compute it directly by taking the expectation of $f_2(x) = \frac{1}{2}(x-1)^2$,

Not surprisingly, we arrive at the same result. So, it seems the Pearson $\chi^2$ divergence is only well-defined for certain pairs of Gaussians; namely, for those satisfying $\sigma^2 < 2\rho^2$. Otherwise, the divergence becomes complex, and it’s not clear how to interpret that.

## Variance constraint for $\alpha > 1$

For $\alpha > 1$, condition \eqref{var} implies the inequality

that says that the variance of $p$ should be smaller than the variance of $q$ times a constant factor. Curiously, when $\alpha \to \infty$, the right-hand side tends to $\rho^2$, leading to the inequality $\sigma^2 \leq \rho^2$. Thus, we come to the following conclusion:

For Gaussian densities $p$ and $q$, when $\alpha \to \infty$, one cannot measure distance from $q$ to $p$ using the $\alpha$-divergence if the variance of $p$ is bigger than the variance of $q$, $$\alpha \to \infty \;\Rightarrow\; \sigma^2 \leq \rho^2.$$ On the other hand, when $\alpha \searrow 1$, the constraint on the variances disappears, $$\alpha \searrow 1 \;\Rightarrow\; \sigma^2 < \infty.$$ Such constraints emerge from the requirement that the integral $\E_q \left[ f_\alpha\left( \frac{p}{q} \right) \right]$ has to exist.

## Variance constraint for $\alpha < 0$

The $\alpha$-divergence has a certain symmetry property. Namely, replacing $\alpha > 1$ by an $\alpha$ symmetric with respect to $\frac{1}{2}$ is equivalent to swapping $p$ and $q$. For example, $D_1(p\|q) = D_0(q\|p)$ or $D_2(p\|q) = D_{-1}(q\|p)$. Therefore, we can transfer our conclusions from the case $\alpha > 1$ to the case $\alpha < 0$ by swapping the roles of $\sigma$ and $\rho$.

For Gaussian densities $p$ and $q$, when $\alpha \to -\infty$, one cannot measure distance from $q$ to $p$ using the $\alpha$-divergence if the variance of $p$ is smaller than the variance of $q$, $$\alpha \to -\infty \;\Rightarrow\; \sigma^2 \geq \rho^2.$$ On the other hand, when $\alpha \nearrow 0$, the constraint on the variances disappears, $$\alpha \nearrow 0 \;\Rightarrow\; \sigma^2 > 0.$$ In general, the condition $$\alpha < 0 \;\Rightarrow\; \sigma^2 > \frac{\alpha}{\alpha-1} \rho^2$$ bounds the variance of $p$ from below.