Fisher metric vs KL-divergence

Let $P$ and $Q$ be probability measures over a set $X$, and let $P$ be absolutely continuous with respect to $Q$. If $\mu$ is any measure on $X$ for which $\displaystyle p = \frac{\mathrm{d}P}{\mathrm{d}\mu}$ and $\displaystyle q = \frac{\mathrm{d}Q}{\mathrm{d}\mu}$ exist, then the Kullback-Leibler divergence from $Q$ to $P$ is given as

Let the density $q = q(x, \theta)$ be parameterized by a vector $\theta$ and let $p$ be a variation of $q$, i.e., $p = q + \delta q$, where $\displaystyle \delta q = \frac{\partial q}{\partial \theta_m} \delta \theta_m$. Then

where we recognize $g_{jk}$, the Fisher information metric,

Thus, the Fisher information metric is the second derivative of the Kullback-Leibler divergence,

$$ \begin{equation*} g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial \theta_k \partial \theta_j} \right|_{\theta = \theta_0} D_{\mathrm{KL}}\left(Q(\theta) \middle\| Q(\theta_0)\right). \end{equation*} $$

Bonus: one prominent equality for the Fisher information

Let’s prove the following useful equality:

Consider the argument on the right-hand side:

Compute its expectation:

The second term on the right equals zero, which concludes the proof. Derivations in this post closely follow the book by Kullback.