Many faces of entropy

Given a finite measure π absolutely continuous with respect to a σ-finite measure μ, the entropy of π under measure μ is Hμ(π)=Sπμlogπμdμ, where πμ is the Radon-Nikodym derivative of π with respect to μ.

With this definition, the KL divergence D(πμ) from μ to π equals minus entropy of π under μ,

(1)D(πμ)=Hμ(π).

The differential entropy h(π) is the entropy of π under the Lebesgue measure λ,

h(π)=Hλ(π).

The joint entropy h(X,Y) equals the entropy of the joint distribution μXY over X and Y under the Lebesgue measure,

h(X,Y)=Hλ(μXY).

The mutual information I(X;Y) equals minus entropy of the joint distribution μXY under the product measure μX×μY,

I(X;Y)=HμX×μY(μXY).

The conditional entropy h(Y|X) equals the entropy of μXY under the product measure μX×λ,

H(Y|X)=HμX×λ(μXY).

Such unification was actually the motivation of Kullback and Leibler’s seminal paper on information and sufficiency.

*Variation of measure

Formula (1) shows that we can as well take the KL as the basis for expressing all other entropies. I find it easier to think in terms of distance (or divergence) rather than entropy, therefore I prefer to work with the KL. Let’s denote KL(πμ) by Dμ(π) to use the same notation as for the entropy Hμ(π), then Dμ(π)=Eμ[f(dπdμ)] with the generator function f(x)=xlogx(x1) corresponding to the KL divergence (see f-divergence for details).

Let us see how Dμ(π) changes when we vary its arguments. Taking partial functional derivatives, we find δDμ(π)δπ(dx)=f(π(dx)μ(dx)),δDμ(π)δμ(dx)=f(π(dx)μ(dx))π(dx)μ(dx)f(π(dx)μ(dx)). Note that since the arguments of D are measures, its partial derivatives take the same argument as measures do, i.e., they take measurable subsets (which I denoted by dx) as input.

If we denote the Radon-Nikodym derivative by πμ, we can also write δDμ(π)δπμ(x)=f(πμ(x)),δDμ(π)δμπ(x)=f(πμ(x))πμ(x)f(πμ(x)). For f corresponding to the KL divergence, these formulas are particularly simple δDμ(π)δπμ(x)=log(πμ(x)),δDμ(π)δμπ(x)=1μπ(x)+1. The fact that the derivative with respect to πμ equals the logarithm is the defining property of the KL divergence. Basically, the KL generator f is the antiderivative of the logarithm, and that is where all the magic stems from. If you minimize a KL (or maximize entropy), you end up with a logarithm after differentiating the objective, which leads you to an exponential family. And all because f=log. However, no such beautiful magic emerges from the other partial derivative.