The $\alpha$-Gaussian

13 Oct 2018

It would be a nice story to tell if the maximum $\alpha$-entropy distribution turned out to be the $\alpha$-Gaussian. But it wouldn’t deserve a post on its own, would it?

Motivation

Do you remember the post where we computed the $\alpha$-divergence between two Gaussian distributions on the real line? Formula (2) derived there had a weird property to only apply if a certain condition (3) on variances is satisfied. What it means, for example, is that one cannot freely pick any two Gaussians and ask what the $2$-divergence between them is, because the $2$-divergence is only defined if the variance of the Gaussian in the numerator is smaller than twice the variance of the Gaussian in the denominator. Back then I could not explain this strange phenomenon; this post is an attempt to gain a better understanding of it.

The $\alpha$-divergence

The $\alpha$-divergence is a generalization of the KL divergence obtained by a smooth deformation of the logarithm function. More concretely, let’s define a new function $\log_\alpha \colon \mathbb{R} \to \mathbb{R}$ where $\alpha \in \mathbb{R}$, called the $\alpha$-logarithm, \begin{equation} \log_{\alpha}(x):=\frac{x^{\alpha-1}-1}{\alpha-1}. \end{equation} In the limit $\alpha \to 1$, this function coincides with the familiar natural logarithm. Note that $\alpha$ is just a real-valued parameter and not the base; no confusion with the base should arise because only the natural logarithm is used here.

The $\alpha$-divergence is defined in a similar way as the KL divergence, \begin{equation} \text{KL}_{\alpha}(p\|q) := \frac{1}{\alpha} \int_{-\infty}^{\infty} p(x)\log_{\alpha}\left(\frac{p(x)}{q(x)}\right)dx, \end{equation} with the natural logarithm $\log(x)$ replaced by $\alpha^{-1}\log_\alpha(x)$. For more details on the definition of the $\alpha$-divergence, see the great article by Andrzej Cichocki and Shun-ichi Amari in Entropy 2010 Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities.

The $\alpha$-exponential function

The KL divergence works so beautifully with exponential families because the logarithm is the inverse of the exponential function. However, if the logarithm gets replaced by the $\alpha$-logarithm, such forward-inverse relation no longer holds. That is why calculating the $\alpha$-divergence between Gaussians is so cumbersome. Thus, we are naturally led to introducing the inverse of the $\alpha$-logarithm, called the $\alpha$-exponential function, \begin{equation} \exp_{\alpha}(y) := \sqrt[\alpha-1]{1+(\alpha-1)y}. \end{equation} The key property which is retained can be stated as $\exp_{\alpha}(\log_{\alpha}(x)) = x.$ On the other hand, other nice properties are lost, e.g., $\exp_{\alpha}(x+y) \neq \exp_{\alpha}(x) \exp_{\alpha}(y)$. Moreover, the domain of the $\alpha$-exponential function depends on $\alpha$; namely, condition $1+(\alpha-1)y\geq0$ must be fulfilled for $\exp_{\alpha}(y)$ to return a real number. The domain restriction property can be seen as a weakness but also as a strength: it enables definition of finitely-supported $\alpha$-exponential families.

The $\alpha$-Gaussian distribution

By replacing the exponential function in the probability density of the normal distribution with the $\alpha$-exponential function, we obtain the $\alpha$-Gaussian probability density function \begin{equation} N_{\alpha}\left(x|\mu,\sigma^{2}\right):=\frac{1}{c_{\alpha}} \text{e}_{\alpha}^{-\frac{1}{2}\frac{\left(x-\mu\right)^{2}}{\sigma^{2}}} \end{equation} where the normalization constant is given by \begin{equation} c_{\alpha}:=\begin{cases} \sqrt{\frac{2\pi\sigma^{2}}{1-\alpha}} \frac{\Gamma\left(\frac{1}{2}\frac{1+\alpha}{1-\alpha}\right)} {\Gamma\left(\frac{1}{1-\alpha}\right)}, & \alpha\in(-1,1],\quad x\in\mathbb{R}, \newline \frac{\sqrt{2\pi\sigma^{2}}}{\left(\alpha-1\right)^{\frac{3}{2}}} \frac{\Gamma\left(\frac{1}{\alpha-1}\right)} {\Gamma\left(\frac{1}{2}+\frac{\alpha}{\alpha-1}\right)}, & \alpha>1,\quad x\in\left(\mu-\sqrt{\frac{2}{\alpha-1}}, \mu+\sqrt{\frac{2}{\alpha-1}}\right). \end{cases} \end{equation} Curiously, the support of the $\alpha$-Gaussian density depends on $\alpha$: for $\alpha > 1$, the support is finite; for $\alpha \in (-1, 1]$, it is infinite; and for $\alpha \leq -1$, the support is still infinite but the density cannot be normalized.

An intuitive way to understand the effect of $\alpha$ is to start at $\alpha \to \infty$ and see how decreasing $\alpha$ towards $\alpha \to -\infty$ changes the probability density function. For very large $\alpha \to \infty$, the density $N_\alpha\left(x|\mu,\sigma^2\right)$ is entirely concentrated around the mean $\mu$; the variance $\sigma^2$ controls the shape of the density but does not affect its support. When we decrease $\alpha$ towards $\alpha = 1$, the density starts spreading out, growing towards the normal density, which it matches in the limit $\alpha \to 1$. If we keep decreasing $\alpha$, the tails of the $\alpha$-Gaussian density function become heavier, enveloping the Gaussian tails; at $\alpha \to -1$, the tails become so heavy that the probability mass becomes infinite. From that point on, decreasing $\alpha$ further makes the function look more and more like a constant function, matching it in the limit $\alpha \to -\infty$. Thus, by changing $\alpha$ we control the tails of the distribution.

For completeness, we can obtain the cumulative distribution function \begin{equation} F_{\alpha}(x):=\frac{c_{\alpha}}{2}+x\,_{2}F_{1} \left( \frac{1}{2},\frac{1}{1-\alpha};\frac{3}{2}; -\frac{\left(1-\alpha\right)}{2}x^{2} \right) \end{equation} where $\,_{2}F_{1}$ is the hypergeometric function. One could use it, for example, to sample from the $\alpha$-Gaussian distribution via inverse transform sampling; however, inverting it is not straightforward.

$\text{KL}_\alpha$ between $\alpha$-Gaussians

The motivation behind introducing the $\alpha$-Gaussian distribution was to simplify the computation of the $\alpha$-divergence. Unfortunately, since the $\alpha$-exponential function is not a homomorphism between addition and multiplication, one cannot nicely transform the ratio of densities into a single exponential which could then be removed by a logarithm, the way it is done in the KL divergence. Therefore, computing $\text{KL}_\alpha$ between $\alpha$-Gaussians is actually not easier than computing $\text{KL}_\alpha$ between usual Gaussians. Moreover, even computing the normal $\text{KL}$ divergence between $\alpha$-Gaussians analytically appears to be infeasible.

Max $\alpha$-entropy distribution

Getting back to the question posed in the very beginning. Is the $\alpha$-Gaussian the maximum $\alpha$-entropy distribution, similar to how the Gaussian distribution is the maximum entropy distribution? The answer is unfortunately no, and furthermore, it seems to be intractable to analytically derive any $\alpha$-exponential family from the maximum $\alpha$-entropy principle. The reason lies again in the fact that even though we can write down the density in the form $p_\alpha(x) = \text{e}_{\alpha}^{-\eta-\lambda x-\nu x^{2}}$, we can neither normalize it nor evaluate the moments because $\alpha$-exponentiation doesn’t transform addition into multiplication.

Conclusion

Alas, the $\alpha$-Gaussian distribution didn’t help us to simplify the computation of the $\alpha$-divergence. However, we have seen that by varying $\alpha$ one can generated either heavy-tailed distributions or light-tailed finitely-supported distribution. The fact that the $2$-Gaussian is finitely-supported gives a hint on why the $2$-KL between Gaussians is finite only on a subset of pairs of distributions but does not provide a direct explanation. Estimating the parameters of the $\alpha$-Gaussian distribution from data using maximum likelihood is probably intractable because the $\alpha$-logarithm does not turn a product into a sum. Sampling also does not appear to be straightforward. With all these disadvantages, the $\alpha$-Gaussian distribution is unlikely to become a useful tool in the statistical toolbox, but is rather doomed to remain a bizarre animal in the information geometry zoo.

Boris Belousov