Geodesic distance between probability distributions is not the KL divergence
11 Jul 2017Ever wondered how to measure distance between probability distributions? Statistical distance article on Wikipedia gives more than 10 different definitions. Why so many and which one to use? In this post, we will try to develop some geometric intuition to bring order into this diversity.
For simplicity, let’s consider a discrete probability distribution
over two elements
Metric induced from the sphere¶
Consider a sphere
Note that
The Fisher metric is usually defined without the
where subscript
Geodesic distance between distributions¶
If we have two distributions
This formula assumes
This formula, however, only works in 2D. We could derive a more general one
if we stayed a little bit longer in the complex plane.
Namely, substituting the limits directly in
and recognizing the
This formula, in contrast to
This is nothing else but the arc length on a unit sphere.
To be honest, we could have stated this result immediately,
since
Fisher metric is the Hessian of the KL divergence¶
How does the KL divergence relate to the geodesic distance? Infinitesimally, the KL looks like the Fisher metric, as was shown by Kullback in Information Theory and Statistics,
where we recognize the diagonal Hessian
As
is different from the geodesic distance
*Get rid of the proportionality constant¶
As a side note, if one would define the KL divergence as
then there would be no proportionality constant between
Another curious observation is that the transformation
in the variables
Geodesic distance vs KL divergence¶
The animation below shows the KL divergence

KL is a good approximation of the geodesic distance except for orthogonal distributions.
For intermediate values of
Distance between Gaussians¶
The same reasoning can be applied to continuous distributions.
Let’s try it on 1D Gaussians.
The KL divergence between univariate Gaussians
To calculate the Fisher distance
The geodesic distance is then given by the
which is a more complicated function of the parameters
Distance between Gaussians is well approximated by the KL divergence when distributions are close.
Similarly as for discrete distributions, once Gaussians are far apart,
the KL grows unbounded, whereas the geodesic distance levels off.
Thus, the KL divergence exagerrates the distance.
For example, the KL keeps growing as we
increase the distance between
What consequences does quadratic growth vs saturation have in practice?
In terms of optimization, the KL is preferable as it always gives
a non-vanishing gradient independent of how far away the distributions are.
However, the KL grows together with the
*Hellinger distance¶
returns the Euclidean distance between two points on a sphere instead of the arc length. Not surprisingly, it locally approximates the geodesic distance, because a tiny arc is indistinguishable from a line segment. However, globally, the Hellinger distance underestimates the geodesic distance, as is clear from the geometrical interpretation.