Many faces of entropy

Given a finite measure $\pi$ absolutely continuous with respect to a $\sigma$-finite measure $\mu$, the entropy of $\pi$ under measure $\mu$ is $$ H_\mu(\pi) = -\int_S \pi_\mu \log \pi_\mu \,d\mu, $$ where $\pi_\mu$ is the Radon-Nikodym derivative of $\pi$ with respect to $\mu$.

With this definition, the KL divergence $D(\pi \| \mu)$ from $\mu$ to $\pi$ equals minus entropy of $\pi$ under $\mu$,

\begin{equation} \label{kl} D(\pi \| \mu) = -H_\mu(\pi). \end{equation}

The differential entropy $h(\pi)$ is the entropy of $\pi$ under the Lebesgue measure $\lambda$,

The joint entropy $h(X, Y)$ equals the entropy of the joint distribution $\mu_{XY}$ over $X$ and $Y$ under the Lebesgue measure,

The mutual information $I(X; Y)$ equals minus entropy of the joint distribution $\mu_{XY}$ under the product measure $\mu_X \times \mu_Y$,

The conditional entropy $h(Y|X)$ equals the entropy of $\mu_{XY}$ under the product measure $\mu_X \times \lambda$,

Such unification was actually the motivation of Kullback and Leibler’s seminal paper on information and sufficiency.

*Variation of measure

Formula \eqref{kl} shows that we can as well take the KL as the basis for expressing all other entropies. I find it easier to think in terms of distance (or divergence) rather than entropy, therefore I prefer to work with the KL. Let’s denote $KL(\pi \| \mu)$ by $D_\mu(\pi)$ to use the same notation as for the entropy $H_\mu(\pi)$, then \begin{equation*} \newcommand{\E}{\mathbb{E}} D_\mu(\pi) = \E_\mu \left[ f\left( \frac{d\pi}{d\mu} \right) \right] \end{equation*} with the generator function $f(x) = x\log x - (x - 1)$ corresponding to the KL divergence (see f-divergence for details).

Let us see how $D_\mu(\pi)$ changes when we vary its arguments. Taking partial functional derivatives, we find \begin{align*} \frac{\delta D_\mu(\pi)}{\delta \pi} (dx) &= f^\prime \left( \frac{\pi(dx)}{\mu(dx)} \right), \\ \frac{\delta D_\mu(\pi)}{\delta \mu} (dx) &= f \left( \frac{\pi(dx)}{\mu(dx)} \right) - \frac{\pi(dx)}{\mu(dx)} f^\prime \left( \frac{\pi(dx)}{\mu(dx)} \right). \end{align*} Note that since the arguments of $D$ are measures, its partial derivatives take the same argument as measures do, i.e., they take measurable subsets (which I denoted by $dx$) as input.

If we denote the Radon-Nikodym derivative by $\pi_\mu$, we can also write \begin{align*} \frac{\delta D_\mu(\pi)}{\delta \pi_\mu} (x) &= f^\prime (\pi_\mu(x)), \\ \frac{\delta D_\mu(\pi)}{\delta \mu_\pi} (x) &= f (\pi_\mu(x)) - \pi_\mu(x) f^\prime (\pi_\mu(x)). \end{align*} For $f$ corresponding to the KL divergence, these formulas are particularly simple \begin{align*} \frac{\delta D_\mu(\pi)}{\delta \pi_\mu} (x) &= \log (\pi_\mu(x)), \\ \frac{\delta D_\mu(\pi)}{\delta \mu_\pi} (x) &= -\frac{1}{\mu_\pi(x)} + 1. \end{align*} The fact that the derivative with respect to $\pi_\mu$ equals the logarithm is the defining property of the KL divergence. Basically, the KL generator $f$ is the antiderivative of the logarithm, and that is where all the magic stems from. If you minimize a KL (or maximize entropy), you end up with a logarithm after differentiating the objective, which leads you to an exponential family. And all because $f^\prime = \log$. However, no such beautiful magic emerges from the other partial derivative.