# KL minimization vs maximum likelihood estimation

23 Apr 2017Given samples $x_{1:N}$ from a distribution $\pi$, we consider the empirical distribution $\tilde{\pi}$ with density

as an approximation of the true distribution $\pi$; the more samples we have, the better the approximation.

To find a parametric distribution $\pi^\omega$ that best fits the samples,
it is reasonable to minimize the KL divergence between
the empirical distribution $\tilde{\pi}$
and our approximating parametric distribution $\pi^\omega$.
Denote densities with respect to the Lebesgue measure
by a subscript $\lambda$, i.e., $d\pi = \pi_\lambda \,d\lambda$.
It is easy to see that *minimization of the KL divergence*

*is equivalent to maximization of the log-likelihood*

The reason we consider the KL divergence from $\pi^\omega$ to $\tilde{\pi}$ and not in the opposite direction is because $\tilde{\pi} \ll \pi^\omega$ but not the other way around.