Many faces of entropy
16 Apr 2017
Given a finite measure absolutely continuous with respect
to a -finite measure ,
the entropy of under measure is
where is the Radon-Nikodym derivative
of with respect to .
With this definition, the KL divergence from to
equals minus entropy of under ,
The differential entropy is the entropy of under the
Lebesgue measure ,
The joint entropy equals the entropy of the joint distribution
over and under the Lebesgue measure,
The mutual information equals minus entropy of the joint
distribution under the product measure ,
The conditional entropy equals the entropy of
under the product measure ,
Such unification was actually the motivation of Kullback and Leibler’s
seminal paper on information and sufficiency.
*Variation of measure
Formula shows that we can as well take the KL as the basis for
expressing all other entropies. I find it easier to think in terms of
distance (or divergence) rather than entropy, therefore I prefer to work
with the KL. Let’s denote by
to use the same notation as for the entropy , then
with the generator function
corresponding to the KL divergence (see f-divergence for details).
Let us see how changes when we vary its arguments.
Taking partial functional derivatives, we find
Note that since the arguments of are measures, its partial derivatives
take the same argument as measures do, i.e.,
they take measurable subsets (which I denoted by ) as input.
If we denote the Radon-Nikodym derivative by , we can also write
For corresponding to the KL divergence, these formulas are
particularly simple
The fact that the derivative with respect to
equals the logarithm is the defining property of the KL divergence.
Basically, the KL generator is the antiderivative of the logarithm,
and that is where all the magic stems from.
If you minimize a KL (or maximize entropy),
you end up with a logarithm after differentiating the objective,
which leads you to an exponential family.
And all because .
However, no such beautiful magic emerges from the other partial derivative.