# Reverse ELBO


## M-Projection

Let’s first refresh the connection between the M-projection and the ELBO. Given a collection of i.i.d. samples $x_i, i=1,\dots,N,$ that form the empirical distribution $\hat{p}(x) = N^{-1}\sum_{i=1}^N \delta(x - x_i)$, maximum likelihood estimation of the parameter $\theta$ in the generative model \eqref{model} is equivalent to the M-projection $$\label{mle} \theta^* = \arg \min_\theta \, \KL(\hat{p}(x) \| p_\theta(x)).$$ As shown in the post on EM as KL minimization, the KL in \eqref{mle} is upper-bounded by another KL between joint distributions $$\label{m-proj-kl-bound} \KL(\hat{p}(x) \| p_\theta(x)) \leq \KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \; \forall q(z|x),$$ which in turn equals the expected negated ELBO (see Where is the lower bound?), $$\label{m-elbo} \KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)) = \E_{\hat{p}(x)} [-\ELBO_\theta[q|x]] = -\ELBO_\theta[q].$$ Therefore, maximum likelihood estimation of $\theta$ is equivalent to maximization of the evidence lower bound, $$\label{mle_elbo} \theta^* = \arg \max_\theta \left\{ \E_{\hat{p}(x)} [\log p_\theta(x)] \right\} = \arg \max_\theta \left\{ \max_q \, \ELBO_\theta[q] \right\}.$$ In other words, minimization of the KL \eqref{mle} is equivalent to maximization of the ELBO \eqref{mle_elbo}. The distribution $p_{\theta^*}(x)$ obtained by minimizing the KL \eqref{mle} is called the M-Projection, where M stands for ‘moment’ to highlight that $p_{\theta^*}(x)$ matches the moments of $\hat{p}(x)$.

## I-Projection

Having seen the relation between the ELBO and the M-projection in the previous section, let us now find the analog of the ELBO for the I-projection. First, the assumptions on $\hat{p}(x)$ have to be adjusted to avoid problems in the definition when $\hat{p}(x)$ contains delta functions. For concreteness, assume $\hat{p}(x)$ to be an unnormalized density, such that direct sampling from $\hat{p}(x)$ is impossible but evaluation at any point is straightforward. This setting corresponds to posterior inference [1]. The I-projection, also called the information projection, is given by $$\label{i-proj} \theta^* = \arg \min_\theta \, \KL(p_\theta(x) \| \hat{p}(x)).$$ The distribution $p_{\theta^*}(x)$ obtained through minimization of \eqref{i-proj} is sometimes characterized as mode-seeking because it concentrates the probability mass on a single mode of $\hat{p}(x)$. Conveniently, a straightforward analog of the bound \eqref{m-proj-kl-bound} exists for the reverse KL \eqref{i-proj}, and it is given by $$\label{i-proj-kl-bound} \KL(p_\theta(x) \| \hat{p}(x)) \leq \KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)), \; \forall q(z|x).$$ The proof follows the same logic as for the M-projection (see EM as KL minimization). Since the KL in \eqref{i-proj-kl-bound} is reversed with respect to the KL in the definition of the ELBO \eqref{m-elbo}, the corresponding negated quantity to \eqref{i-proj-kl-bound} can be called the reverse ELBO. For the lack of fantasy, let’s denote it $\OLEB$, $$\label{oleb} \OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)).$$ The OLEB is also a lower bound, but it yields the I-projection when maximized, $$\theta^* = \arg \max_\theta \left\{ \max_q \, \OLEB_\theta[q] \right\}.$$ Thus, once the ELBO is written as the KL divergence \eqref{m-elbo}, the ‘reverse ELBO’ \eqref{oleb} is obtained by simply swapping the arguments in the KL. However, optimization of the OLEB is not as straightforward. In [1], it was shown that optimization of the OLEB (denoted $U(\theta, q) = \OLEB_\theta[q]$ in the paper) with respect to $\theta$ can be framed as an instance of black-box relative entropy policy search [2,3], and a practical algorithm for fitting a Gaussian mixture model $p_\theta(x)$ to a given unnormalized posterior $\hat{p}(x)$ was developed in [1] based on modern model-based stochastic search techniques [4].

## Conclusion

As we have seen, there indeed exists an analog of the ELBO for the I-projection, which we denoted OLEB \eqref{oleb}. Although calling it ‘reverse ELBO’ is probably misleading because it is not a bound on the evidence, the reference to reversion makes the definition easy to remember. In conclusion, the relationships between the ELBOs and KL projections can be summarized as follows

Note that usually the assumptions on $\hat{p}(x)$ are different. For the M-projection, it is important to be able to sample from $\hat{p}(x)$, whereas for the I-projection, it is important to be able to evaluate $\hat{p}(x)$. An interesting non-trivial application of the ‘reverse ELBO’ to multimodal posterior inference can be found in [1].

## References

1. Arenz, O., Neumann, G., Zhong, M. (2018). Efficient Gradient-Free Variational Inference using Policy Search, ICML, PMLR 80:234-243.
2. Peters, J., Mülling, K., Altun, Y. (2010). Relative Entropy Policy Search, AAAI, AAAI Press 10:1607-1612.
3. Neumann, G. (2011). Variational Inference for Policy Search in Changing Situations, ICML, Omnipress 11:817-824.
4. Abdolmaleki, A., Peters, J., Neumann, G. (2015). Model-Based Relative Entropy Stochastic Search NIPS, Curran Associates Inc. 28:3537-3545.