Geodesic distance between probability distributions is not the KL divergence

Ever wondered how to measure distance between probability distributions? Statistical distance article on Wikipedia gives more than 10 different definitions. Why so many and which one to use? In this post, we will try to develop some geometric intuition to bring order into this diversity.

For simplicity, let’s consider a discrete probability distribution over two elements p=[p1,p2]. We can view p as a vector in R2 lying on the standard simplex Δ1. We could measure distance between distributions as distance between points on the simplex. One should however be careful as to what metric tensor to use on that space.

Metric induced from the sphere

Consider a sphere x12+x22=1 embedded in R2. Non-linear transformation pi=xi2 turns the sphere into a simplex p1+p2=1. The distance on the sphere is the arc length. Question: What is the distance on the simplex after such a transformation? Following the Fisher metric article, let’s denote the Euclidean metric by h=dx12+dx22. Substituting xi=pi, we obtain

(1)h=(dp1)2+(dp2)2=14dp12p1+14dp22p2=14idpi2pi.

Note that dp1=dp2 and p2=1p1, therefore h simplifies to

h=14dp12p1(1p1).

The Fisher metric is usually defined without the 1/4 coefficient, i.e.,

hpfisher=4hp,

where subscript p denotes the point at which the metric tensor is computed.

Geodesic distance between distributions

If we have two distributions p,qΔ1, the distance between them is the length of the geodesic,

(2)d(p,q)=q1p112dtt(1t)=iln(t+i1t)|q1p1.

This formula assumes q1p1; otherwise, the limits of integration need to be swapped according to the properties of the line integral. To not worry about the order of the limits of integration, one could just as well take the absolute value of the right-hand side. Upon a brief reflection, one may recognize the arccos function,

d(p,q)=arccost|q1p1=arccosq1arccosp1.

This formula, however, only works in 2D. We could derive a more general one if we stayed a little bit longer in the complex plane. Namely, substituting the limits directly in (2),

d(p,q)=iln(p1+ip2q1+iq2)=iln(p1q1+p2q2i(p1q2p2q1)),

and recognizing the arccos function afterwards, we see that

d(p,q)=arccos(p1q1+p2q2).

This formula, in contrast to (2), can be readily generalized to any finite-dimensional distributions. Denoting the standard scalar product by x,y, the geodesic distance between distributions p and q can be written as

(3)d(p,q)=arccosp,q.

This is nothing else but the arc length on a unit sphere. To be honest, we could have stated this result immediately, since pi=xi2 and we defined the distance on the sphere by the arc length, but it is instructive to derive it from the differentials directly.

Fisher metric is the Hessian of the KL divergence

How does the KL divergence relate to the geodesic distance? Infinitesimally, the KL looks like the Fisher metric, as was shown by Kullback in Information Theory and Statistics,

KL(q+dqq)=i(qi+dqi)lnqi+dqiqi12idqi2qi,

where we recognize the diagonal Hessian Hii=1/qi. Comparing it with (1), we see that the KL divergence KL(pq) measures the geodesic length in the vicinity of q,

KL(q+dqq)=2hq(dq,dq)+O(dq3).

As pq, KL(pq)2d2(p,q), which was pointed out by Kass and Vos in Geometrical Foundations of Asymptotic Inference. However, globally, the KL divergence

(4)KL(pq)=ipilnpiqi.

is different from the geodesic distance (3). Thus, the KL divergence is an approximation of the geodesic distance induced by the Fisher metric. It is interesting therefore to investigate how these measures of distance differ when p and q are far apart.

*Get rid of the proportionality constant

As a side note, if one would define the KL divergence as

kl(pq)=ipilnpiqi,

then there would be no proportionality constant between kl(q+dqq) and hq(dq,dq).

Another curious observation is that the transformation qi=ezi allows one to express the kl as a differential quadratic form

kl(q+dqq)=14iqi(dlnqi)2=14iezidzi2,

in the variables zi=lnqi.

Geodesic distance vs KL divergence

The animation below shows the KL divergence KL(pq) and the doubled squared Fisher distance 2d2(p,q) between distributions p=[p0,p1] and q=[q0,q1] as a function of p0 for different values of q0.

KL vs Fisher distance

KL is a good approximation of the geodesic distance except for orthogonal distributions.

For intermediate values of p0 and q0, the KL divergence approximates the geodesic distance very well even if the distributions are far apart. The most noticeable difference between the KL and the Fisher distance is that the KL tends towards infinity for orthogonal distributions (i.e., when q00 and p01), whereas the Fisher distance tends towards a finite value.

Distance between Gaussians

The same reasoning can be applied to continuous distributions. Let’s try it on 1D Gaussians. The KL divergence between univariate Gaussians p(x)=N(x|μ,1) and q(x)=N(x|ν,1) equals

KL(pq)=(μν)22.

To calculate the Fisher distance (3), we need to evaluate the integral

p,q=12πe(xμ)2+(xν)24dx=e(μν)28.

The geodesic distance is then given by the arccos of it,

d(p,q)=arccose(μν)28,

which is a more complicated function of the parameters μ and ν than the KL. Nevertheless, the KL divergence is a good upper bound on the geodesic distance, especially when the distributions are close, as the figure below shows.

KL vs Fisher between Gaussians

Distance between Gaussians is well approximated by the KL divergence when distributions are close.

Similarly as for discrete distributions, once Gaussians are far apart, the KL grows unbounded, whereas the geodesic distance levels off. Thus, the KL divergence exagerrates the distance. For example, the KL keeps growing as we increase the distance between μ and ν despite there being almost no intersection between the densities. The geodesic distance, on the other hand, saturates after some point and reports almost no difference no matter how far apart the Gaussians are.

What consequences does quadratic growth vs saturation have in practice? In terms of optimization, the KL is preferable as it always gives a non-vanishing gradient independent of how far away the distributions are. However, the KL grows together with the L2 distance between the means of the Gaussians, which may not make physical sense. Indeed, if we have samples from two Gaussians, they may lie in non-intersecting intervals of the real line. If we treat distance as a measure of distinguishability of two objects, then it should not matter how far apart the Gaussians are once samples from them do not intersect.

*Hellinger distance

The Hellinger distance

dhell(p,q)=pq2

returns the Euclidean distance between two points on a sphere instead of the arc length. Not surprisingly, it locally approximates the geodesic distance, because a tiny arc is indistinguishable from a line segment. However, globally, the Hellinger distance underestimates the geodesic distance, as is clear from the geometrical interpretation.

Contents

  1. Metric induced from the sphere
  2. Geodesic distance between distributions
  3. Fisher metric is the Hessian of the KL divergence
  4. *Get rid of the proportionality constant
  5. Geodesic distance vs KL divergence
  6. Distance between Gaussians
  7. *Hellinger distance