Reverse ELBO

It is well known that for a latent variable model \begin{equation} \label{model} p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz, \end{equation} posterior $p_\theta(z|x)$ admits a variational representation \begin{equation} \newcommand{\E}{\mathrm{E}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\ELBO}{\mathrm{ELBO}} \newcommand{\OLEB}{\mathrm{OLEB}} p_\theta(z|x) = \arg \max_{q(z|x)} \left\{ \E_{q(z|x)}[p_\theta(x|z)] - \KL[q(z|x) \| p_\theta(z)] \right\} \end{equation} where the term in braces is the $\ELBO_\theta[q|x]$. Moreover, at maximum, the ELBO yields the log-likelihood, \begin{equation} \label{log-lik} \log p_\theta(x) = \max_{q} \, \ELBO_\theta[q|x]. \end{equation} Thanks to \eqref{log-lik}, maximization of the ELBO is equivalent to maximization of the log-likelihood $\log p_\theta(x)$. At the same time, maximization of the log-likelihood is equivalent to KL minimization, as explained in the post MLE = min KL. Therefore, maximization of the ELBO \eqref{log-lik} is equivalent to KL minimization, and in particular to the M-projection. In this post, we find the quantity, which we call the ‘reverse ELBO’, that corresponds to the I-projection in the same way as the ELBO corresponds to the M-projection.

M-Projection

Let’s first refresh the connection between the M-projection and the ELBO. Given a collection of i.i.d. samples $x_i, i=1,\dots,N,$ that form the empirical distribution $\hat{p}(x) = N^{-1}\sum_{i=1}^N \delta(x - x_i)$, maximum likelihood estimation of the parameter $\theta$ in the generative model \eqref{model} is equivalent to the M-projection \begin{equation} \label{mle} \theta^* = \arg \min_\theta \, \KL(\hat{p}(x) \| p_\theta(x)). \end{equation} As shown in the post on EM as KL minimization, the KL in \eqref{mle} is upper-bounded by another KL between joint distributions \begin{equation} \label{m-proj-kl-bound} \KL(\hat{p}(x) \| p_\theta(x)) \leq \KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \; \forall q(z|x), \end{equation} which in turn equals the expected negated ELBO (see Where is the lower bound?), \begin{equation} \label{m-elbo} \KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)) = \E_{\hat{p}(x)} [-\ELBO_\theta[q|x]] = -\ELBO_\theta[q]. \end{equation} Therefore, maximum likelihood estimation of $\theta$ is equivalent to maximization of the evidence lower bound, \begin{equation} \label{mle_elbo} \theta^* = \arg \max_\theta \left\{ \E_{\hat{p}(x)} [\log p_\theta(x)] \right\} = \arg \max_\theta \left\{ \max_q \, \ELBO_\theta[q] \right\}. \end{equation} In other words, minimization of the KL \eqref{mle} is equivalent to maximization of the ELBO \eqref{mle_elbo}. The distribution $p_{\theta^*}(x)$ obtained by minimizing the KL \eqref{mle} is called the M-Projection, where M stands for ‘moment’ to highlight that $p_{\theta^*}(x)$ matches the moments of $\hat{p}(x)$.

I-Projection

Having seen the relation between the ELBO and the M-projection in the previous section, let us now find the analog of the ELBO for the I-projection. First, the assumptions on $\hat{p}(x)$ have to be adjusted to avoid problems in the definition when $\hat{p}(x)$ contains delta functions. For concreteness, assume $\hat{p}(x)$ to be an unnormalized density, such that direct sampling from $\hat{p}(x)$ is impossible but evaluation at any point is straightforward. This setting corresponds to posterior inference [1]. The I-projection, also called the information projection, is given by \begin{equation} \label{i-proj} \theta^* = \arg \min_\theta \, \KL(p_\theta(x) \| \hat{p}(x)). \end{equation} The distribution $p_{\theta^*}(x)$ obtained through minimization of \eqref{i-proj} is sometimes characterized as mode-seeking because it concentrates the probability mass on a single mode of $\hat{p}(x)$. Conveniently, a straightforward analog of the bound \eqref{m-proj-kl-bound} exists for the reverse KL \eqref{i-proj}, and it is given by \begin{equation} \label{i-proj-kl-bound} \KL(p_\theta(x) \| \hat{p}(x)) \leq \KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)), \; \forall q(z|x). \end{equation} The proof follows the same logic as for the M-projection (see EM as KL minimization). Since the KL in \eqref{i-proj-kl-bound} is reversed with respect to the KL in the definition of the ELBO \eqref{m-elbo}, the corresponding negated quantity to \eqref{i-proj-kl-bound} can be called the reverse ELBO. For the lack of fantasy, let’s denote it $\OLEB$, \begin{equation} \label{oleb} \OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)). \end{equation} The OLEB is also a lower bound, but it yields the I-projection when maximized, \begin{equation} \theta^* = \arg \max_\theta \left\{ \max_q \, \OLEB_\theta[q] \right\}. \end{equation} Thus, once the ELBO is written as the KL divergence \eqref{m-elbo}, the ‘reverse ELBO’ \eqref{oleb} is obtained by simply swapping the arguments in the KL. However, optimization of the OLEB is not as straightforward. In [1], it was shown that optimization of the OLEB (denoted $U(\theta, q) = \OLEB_\theta[q]$ in the paper) with respect to $\theta$ can be framed as an instance of black-box relative entropy policy search [2,3], and a practical algorithm for fitting a Gaussian mixture model $p_\theta(x)$ to a given unnormalized posterior $\hat{p}(x)$ was developed in [1] based on modern model-based stochastic search techniques [4].

Conclusion

As we have seen, there indeed exists an analog of the ELBO for the I-projection, which we denoted OLEB \eqref{oleb}. Although calling it ‘reverse ELBO’ is probably misleading because it is not a bound on the evidence, the reference to reversion makes the definition easy to remember. In conclusion, the relationships between the ELBOs and KL projections can be summarized as follows

Note that usually the assumptions on $\hat{p}(x)$ are different. For the M-projection, it is important to be able to sample from $\hat{p}(x)$, whereas for the I-projection, it is important to be able to evaluate $\hat{p}(x)$. An interesting non-trivial application of the ‘reverse ELBO’ to multimodal posterior inference can be found in [1].

References

  1. Arenz, O., Neumann, G., Zhong, M. (2018). Efficient Gradient-Free Variational Inference using Policy Search, ICML, PMLR 80:234-243.
  2. Peters, J., Mülling, K., Altun, Y. (2010). Relative Entropy Policy Search, AAAI, AAAI Press 10:1607-1612.
  3. Neumann, G. (2011). Variational Inference for Policy Search in Changing Situations, ICML, Omnipress 11:817-824.
  4. Abdolmaleki, A., Peters, J., Neumann, G. (2015). Model-Based Relative Entropy Stochastic Search NIPS, Curran Associates Inc. 28:3537-3545.