Reverse ELBO

It is well known that for a latent variable model \begin{equation} \label{model} p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz, \end{equation} posterior $p_\theta(z|x)$ admits a variational representation \begin{equation} \newcommand{\E}{\mathrm{E}} \newcommand{\KL}{\mathrm{KL}} \newcommand{\ELBO}{\mathrm{ELBO}} \newcommand{\OLEB}{\mathrm{OLEB}} p_\theta(z|x) = \arg \max_{q(z|x)} \left\{ \E_{q(z|x)}[p_\theta(x|z)] - \KL[q(z|x) \| p_\theta(z)] \right\} \end{equation} where the term in braces is the $\ELBO_\theta[q|x]$. Moreover, at maximum, the ELBO yields the log-likelihood, \begin{equation} \label{log-lik} \log p_\theta(x) = \max_{q} \, \ELBO_\theta[q|x]. \end{equation} Thanks to \eqref{log-lik}, maximization of the ELBO is equivalent to maximization of the log-likelihood $\log p_\theta(x)$. At the same time, maximization of the log-likelihood is equivalent to KL minimization, as explained in the post MLE = min KL. Therefore, maximization of the ELBO \eqref{log-lik} is equivalent to KL minimization, and in particular to the M-projection. In this post, we find the quantity, which we call the ‘reverse ELBO’, that corresponds to the I-projection in the same way as the ELBO corresponds to the M-projection.

M-Projection

Let’s first refresh the connection between the M-projection and the ELBO. Given a collection of i.i.d. samples $x_i, i=1,\dots,N,$ that form the empirical distribution $\hat{p}(x) = N^{-1}\sum_{i=1}^N \delta(x - x_i)$, maximum likelihood estimation of the parameter $\theta$ in the generative model \eqref{model} is equivalent to the M-projection \begin{equation} \label{mle} \theta^* = \arg \min_\theta \, \KL(\hat{p}(x) \| p_\theta(x)). \end{equation} As shown in the post on EM as KL minimization, the KL in \eqref{mle} is upper-bounded by another KL between joint distributions \begin{equation} \label{m-proj-kl-bound} \KL(\hat{p}(x) \| p_\theta(x)) \leq \KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \; \forall q(z|x), \end{equation} which in turn equals the expected negated ELBO (see Where is the lower bound?), \begin{equation} \label{m-elbo} \KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)) = \E_{\hat{p}(x)} [-\ELBO_\theta[q|x]] = -\ELBO_\theta[q]. \end{equation} Therefore, maximum likelihood estimation of $\theta$ is equivalent to maximization of the evidence lower bound, \begin{equation} \label{mle_elbo} \theta^* = \arg \max_\theta \left\{ \E_{\hat{p}(x)} [\log p_\theta(x)] \right\} = \arg \max_\theta \left\{ \max_q \, \ELBO_\theta[q] \right\}. \end{equation} In other words, minimization of the KL \eqref{mle} is equivalent to maximization of the ELBO \eqref{mle_elbo}. The distribution $p_{\theta^*}(x)$ obtained by minimizing the KL \eqref{mle} is called the M-Projection, where M stands for ‘moment’ to highlight that $p_{\theta^*}(x)$ matches the moments of $\hat{p}(x)$.

I-Projection

Having seen the relation between the ELBO and the M-projection in the previous section, let us now find the analog of the ELBO for the I-projection. First, the assumptions on $\hat{p}(x)$ have to be adjusted to avoid problems in the definition when $\hat{p}(x)$ contains delta functions. For concreteness, assume $\hat{p}(x)$ to be an unnormalized density, such that direct sampling from $\hat{p}(x)$ is impossible but evaluation at any point is straightforward. This setting corresponds to posterior inference [1]. The I-projection, also called the information projection, is given by \begin{equation} \label{i-proj} \theta^* = \arg \min_\theta \, \KL(p_\theta(x) \| \hat{p}(x)). \end{equation} The distribution $p_{\theta^*}(x)$ obtained through minimization of \eqref{i-proj} is sometimes characterized as mode-seeking because it concentrates the probability mass on a single mode of $\hat{p}(x)$. Conveniently, a straightforward analog of the bound \eqref{m-proj-kl-bound} exists for the reverse KL \eqref{i-proj}, and it is given by \begin{equation} \label{i-proj-kl-bound} \KL(p_\theta(x) \| \hat{p}(x)) \leq \KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)), \; \forall q(z|x). \end{equation} The proof follows the same logic as for the M-projection (see EM as KL minimization). Since the KL in \eqref{i-proj-kl-bound} is reversed with respect to the KL in the definition of the ELBO \eqref{m-elbo}, the corresponding negated quantity to \eqref{i-proj-kl-bound} can be called the reverse ELBO. For the lack of fantasy, let’s denote it $\OLEB$, \begin{equation} \label{oleb} \OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)). \end{equation} The OLEB is also a lower bound, but it yields the I-projection when maximized, \begin{equation} \theta^* = \arg \max_\theta \left\{ \max_q \, \OLEB_\theta[q] \right\}. \end{equation} Thus, once the ELBO is written as the KL divergence \eqref{m-elbo}, the ‘reverse ELBO’ \eqref{oleb} is obtained by simply swapping the arguments in the KL. However, optimization of the OLEB is not as straightforward. In [1], it was shown that optimization of the OLEB (denoted $U(\theta, q) = \OLEB_\theta[q]$ in the paper) with respect to $\theta$ can be framed as an instance of black-box relative entropy policy search [2,3], and a practical algorithm for fitting a Gaussian mixture model $p_\theta(x)$ to a given unnormalized posterior $\hat{p}(x)$ was developed in [1] based on modern model-based stochastic search techniques [4].

Conclusion

As we have seen, there indeed exists an analog of the ELBO for the I-projection, which we denoted OLEB \eqref{oleb}. Although calling it ‘reverse ELBO’ is probably misleading because it is not a bound on the evidence, the reference to reversion makes the definition easy to remember. In conclusion, the relationships between the ELBOs and KL projections can be summarized as follows \begin{align} \textrm{M:} \quad \KL(\hat{p}(x) \| p_\theta(x)) \quad \ELBO_\theta[q] = -\KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \\ \textrm{I:} \quad \KL(p_\theta(x) \| \hat{p}(x)) \quad \OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)). \end{align} Note that usually the assumptions on $\hat{p}(x)$ are different. For the M-projection, it is important to be able to sample from $\hat{p}(x)$, whereas for the I-projection, it is important to be able to evaluate $\hat{p}(x)$. An interesting non-trivial application of the ‘reverse ELBO’ to multimodal posterior inference can be found in [1].

References

  1. Arenz, O., Neumann, G., Zhong, M. (2018). Efficient Gradient-Free Variational Inference using Policy Search, ICML, PMLR 80:234-243.
  2. Peters, J., Mülling, K., Altun, Y. (2010). Relative Entropy Policy Search, AAAI, AAAI Press 10:1607-1612.
  3. Neumann, G. (2011). Variational Inference for Policy Search in Changing Situations, ICML, Omnipress 11:817-824.
  4. Abdolmaleki, A., Peters, J., Neumann, G. (2015). Model-Based Relative Entropy Stochastic Search NIPS, Curran Associates Inc. 28:3537-3545.