# Reverse ELBO

24 Jul 2019It is well known that for a latent variable model
\begin{equation}
\label{model}
p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz,
\end{equation}
posterior $p_\theta(z|x)$ admits a *variational representation*
\begin{equation}
\newcommand{\E}{\mathrm{E}}
\newcommand{\KL}{\mathrm{KL}}
\newcommand{\ELBO}{\mathrm{ELBO}}
\newcommand{\OLEB}{\mathrm{OLEB}}
p_\theta(z|x) = \arg \max_{q(z|x)}
\left\{ \E_{q(z|x)}[p_\theta(x|z)] - \KL[q(z|x) \| p_\theta(z)] \right\}
\end{equation}
where the term in braces is the $\ELBO_\theta[q|x]$.
Moreover, at maximum, the ELBO yields the log-likelihood,
\begin{equation}
\label{log-lik}
\log p_\theta(x) = \max_{q} \, \ELBO_\theta[q|x].
\end{equation}
Thanks to \eqref{log-lik}, maximization of the ELBO
is equivalent to maximization of the log-likelihood $\log p_\theta(x)$.
At the same time, maximization of the log-likelihood
is equivalent to KL minimization, as explained in the post MLE = min KL.
Therefore, maximization of the ELBO \eqref{log-lik} is equivalent to KL minimization,
and in particular to the M-projection.
In this post, we find the quantity, which we call the ‘reverse ELBO’,
that corresponds to the I-projection
in the same way as the ELBO corresponds to the M-projection.

## M-Projection

Let’s first refresh the connection between the M-projection and the ELBO.
Given a collection of i.i.d. samples $x_i, i=1,\dots,N,$
that form the empirical distribution $\hat{p}(x) = N^{-1}\sum_{i=1}^N \delta(x - x_i)$,
maximum likelihood estimation of the parameter $\theta$ in the generative model \eqref{model}
is equivalent to the M-projection
\begin{equation}
\label{mle}
\theta^* = \arg \min_\theta \, \KL(\hat{p}(x) \| p_\theta(x)).
\end{equation}
As shown in the post on EM as KL minimization,
the KL in \eqref{mle} is upper-bounded by another KL between joint distributions
\begin{equation}
\label{m-proj-kl-bound}
\KL(\hat{p}(x) \| p_\theta(x)) \leq
\KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \; \forall q(z|x),
\end{equation}
which in turn equals the expected negated ELBO (see Where is the lower bound?),
\begin{equation}
\label{m-elbo}
\KL(\hat{p}(x)q(z|x) \| p_\theta(x, z))
= \E_{\hat{p}(x)} [-\ELBO_\theta[q|x]]
= -\ELBO_\theta[q].
\end{equation}
Therefore, maximum likelihood estimation of $\theta$
is equivalent to maximization of the evidence lower bound,
\begin{equation}
\label{mle_elbo}
\theta^* = \arg \max_\theta \left\{ \E_{\hat{p}(x)} [\log p_\theta(x)] \right\}
= \arg \max_\theta \left\{ \max_q \, \ELBO_\theta[q] \right\}.
\end{equation}
In other words, *minimization of the KL \eqref{mle}
is equivalent to maximization of the ELBO \eqref{mle_elbo}*.
The distribution $p_{\theta^*}(x)$ obtained by minimizing the KL \eqref{mle}
is called the M-Projection,
where M stands for ‘moment’ to highlight that $p_{\theta^*}(x)$
matches the moments of $\hat{p}(x)$.

## I-Projection

Having seen the relation between the ELBO and the M-projection
in the previous section, let us now find the analog of the ELBO for the I-projection.
First, the assumptions on $\hat{p}(x)$ have to be adjusted
to avoid problems in the definition when $\hat{p}(x)$ contains delta functions.
For concreteness, assume $\hat{p}(x)$ to be an unnormalized density,
such that direct sampling from $\hat{p}(x)$ is impossible
but evaluation at any point is straightforward.
This setting corresponds to posterior inference [1].
The I-projection, also called the information projection, is given by
\begin{equation}
\label{i-proj}
\theta^* = \arg \min_\theta \, \KL(p_\theta(x) \| \hat{p}(x)).
\end{equation}
The distribution $p_{\theta^*}(x)$ obtained through minimization of \eqref{i-proj}
is sometimes characterized as mode-seeking because it concentrates the probability mass
on a single mode of $\hat{p}(x)$.
Conveniently, a straightforward analog of the bound \eqref{m-proj-kl-bound} exists
for the reverse KL \eqref{i-proj}, and it is given by
\begin{equation}
\label{i-proj-kl-bound}
\KL(p_\theta(x) \| \hat{p}(x)) \leq
\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)), \; \forall q(z|x).
\end{equation}
The proof follows the same logic as for the M-projection (see EM as KL minimization).
Since the KL in \eqref{i-proj-kl-bound} is reversed with respect to the KL in the definition
of the ELBO \eqref{m-elbo}, the corresponding negated quantity to \eqref{i-proj-kl-bound}
can be called the *reverse ELBO*.
For the lack of fantasy, let’s denote it $\OLEB$,
\begin{equation}
\label{oleb}
\OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)).
\end{equation}
The OLEB is also a lower bound, but it yields the I-projection when maximized,
\begin{equation}
\theta^* = \arg \max_\theta \left\{ \max_q \, \OLEB_\theta[q] \right\}.
\end{equation}
Thus, once the ELBO is written as the KL divergence \eqref{m-elbo},
the ‘reverse ELBO’ \eqref{oleb} is obtained by simply swapping the arguments in the KL.
However, *optimization* of the OLEB is not as straightforward.
In [1], it was shown that optimization of the OLEB
(denoted $U(\theta, q) = \OLEB_\theta[q]$ in the paper) with respect to $\theta$ can be framed
as an instance of black-box *relative entropy policy search* [2,3],
and a practical algorithm for fitting a Gaussian mixture model $p_\theta(x)$
to a given unnormalized posterior $\hat{p}(x)$ was developed in [1]
based on modern model-based stochastic search techniques [4].

## Conclusion

As we have seen, there indeed exists an analog of the ELBO for the I-projection, which we denoted OLEB \eqref{oleb}. Although calling it ‘reverse ELBO’ is probably misleading because it is not a bound on the evidence, the reference to reversion makes the definition easy to remember. In conclusion, the relationships between the ELBOs and KL projections can be summarized as follows

Note that usually the assumptions on $\hat{p}(x)$ are different. For the M-projection, it is important to be able to sample from $\hat{p}(x)$, whereas for the I-projection, it is important to be able to evaluate $\hat{p}(x)$. An interesting non-trivial application of the ‘reverse ELBO’ to multimodal posterior inference can be found in [1].

## References

- Arenz, O., Neumann, G., Zhong, M. (2018). Efficient Gradient-Free Variational Inference using Policy Search, ICML, PMLR 80:234-243.
- Peters, J., Mülling, K., Altun, Y. (2010). Relative Entropy Policy Search, AAAI, AAAI Press 10:1607-1612.
- Neumann, G. (2011). Variational Inference for Policy Search in Changing Situations, ICML, Omnipress 11:817-824.
- Abdolmaleki, A., Peters, J., Neumann, G. (2015). Model-Based Relative Entropy Stochastic Search NIPS, Curran Associates Inc. 28:3537-3545.