Fisher metric vs KL-divergence

16 Oct 2016

Let $P$ and $Q$ be probability measures over a set $X$, and let $P$ be absolutely continuous with respect to $Q$. If $\mu$ is any measure on $X$ for which $\displaystyle p = \frac{\mathrm{d}P}{\mathrm{d}\mu}$ and $\displaystyle q = \frac{\mathrm{d}Q}{\mathrm{d}\mu}$ exist, then the Kullback-Leibler divergence from $Q$ to $P$ is given as

$\begin{equation*} \left. D_{\mathrm{KL}}(P \middle\| Q) \right. = \int_X p \log \frac{p}{q} \mathrm{d} \mu. \end{equation*}$

Let the density $q = q(x, \theta)$ be parameterized by a vector $\theta$ and let $p$ be a variation of $q$, i.e., $p = q + \delta q$, where $\displaystyle \delta q = \frac{\partial q}{\partial \theta_m} \delta \theta_m$. Then

$\begin{equation*} \begin{split} D_{\mathrm{KL}}\left(P \middle\| Q\right) &= \int_X p \log \frac{p}{q} \mathrm{d}\mu = \int_X (q + \delta q) \log \frac{q + \delta q}{q} \mathrm{d}\mu \\ &= \int_X q \log (1 + \frac{\delta q}{q}) \mathrm{d}\mu + \int_X \delta q \log (1 + \frac{\delta q}{q}) \mathrm{d}\mu \\ &\approx \int_X q \left(\frac{\delta q}{q} - \frac{(\delta q)^2}{2q^2}\right) \mathrm{d}\mu + \int_X \delta q \frac{\delta q}{q} \mathrm{d}\mu \\ &= \int_X \delta q \mathrm{d}\mu + \frac{1}{2} \int_X q \frac{(\delta q)^2}{q^2} \mathrm{d}\mu \\ &= \delta \theta_m \frac{\partial}{\partial \theta_m} \int_X q \mathrm{d}\mu + \frac{1}{2} \delta \theta_k \delta \theta_j \int_X q \left(\frac{1}{q} \frac{\partial q}{\partial \theta_k}\right) \left(\frac{1}{q} \frac{\partial q}{\partial \theta_j}\right) \mathrm{d}\mu \\ &= 0 + \frac{1}{2} \delta \theta_k \delta \theta_j \mathrm{E} \left\{ \frac{\partial \log q}{\partial \theta_k} \frac{\partial \log q}{\partial \theta_j} \right\} \\ &= \frac{1}{2} \delta \theta_k \delta \theta_j g_{jk}(\theta), \end{split} \end{equation*}$

where we recognize $g_{jk}$, the Fisher information metric,

$\begin{equation*} g_{jk}(\theta) = \mathrm{E} \left\{ \frac{\partial \log q}{\partial \theta_k} \frac{\partial \log q}{\partial \theta_j} \right\}. \end{equation*}$

Thus, the Fisher information metric is the second derivative of the Kullback-Leibler divergence,

$$ \begin{equation*} g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial \theta_k \partial \theta_j} \right|_{\theta = \theta_0} D_{\mathrm{KL}}\left(Q(\theta) \middle\| Q(\theta_0)\right). \end{equation*} $$

Bonus: one prominent equality for the Fisher information

Let’s prove the following useful equality:

$\begin{equation*} \mathrm{E} \frac{\partial \log p}{\partial \theta_k} \frac{\partial \log p}{\partial \theta_j} = -\mathrm{E} \frac{\partial^2 \log p}{\partial \theta_k \partial \theta_j}. \end{equation*}$

Consider the argument on the right-hand side:

$\begin{equation*} \frac{\partial}{\partial \theta_k} \left( \frac{\partial \log p}{\partial \theta_j} \right) = \frac{\partial}{\partial \theta_k} \left( \frac{1}{p} \frac{\partial p}{\partial \theta_j} \right) = -\frac{1}{p^2} \frac{\partial p}{\partial \theta_k} \frac{\partial p}{\partial \theta_j} + \frac{1}{p} \frac{\partial^2 p}{\partial \theta_k \partial \theta_j}. \end{equation*}$

Compute its expectation:

$\begin{equation*} -\mathrm{E} \frac{\partial^2 \log p}{\partial \theta_k \partial \theta_j} = \int_X p \frac{\partial \log p}{\partial \theta_k} \frac{\partial \log p}{\partial \theta_j} \mathrm{d}\mu - \frac{\partial^2}{\partial \theta_k \partial \theta_j} \int_X p \mathrm{d}\mu. \end{equation*}$

The second term on the right equals zero, which concludes the proof. Derivations in this post closely follow the book by Kullback.

Boris Belousov

Fisher metric vs KL-divergence

Bonus: one prominent equality for the Fisher information