# Reverse ELBO

24 Jul 2019It is well known that for a latent variable model
\begin{equation}
\label{model}
p_\theta(x) = \int p_\theta(x|z)p_\theta(z) dz,
\end{equation}
posterior $p_\theta(z|x)$ admits a *variational representation*
\begin{equation}
\newcommand{\E}{\mathrm{E}}
\newcommand{\KL}{\mathrm{KL}}
\newcommand{\ELBO}{\mathrm{ELBO}}
\newcommand{\OLEB}{\mathrm{OLEB}}
p_\theta(z|x) = \arg \max_{q(z|x)}
\left\{ \E_{q(z|x)}[p_\theta(x|z)] - \KL[q(z|x) \| p_\theta(z)] \right\}
\end{equation}
where the term in braces is the $\ELBO_\theta[q|x]$.
Moreover, at maximum, the ELBO yields the log-likelihood,
\begin{equation}
\label{log-lik}
\log p_\theta(x) = \max_{q} \, \ELBO_\theta[q|x].
\end{equation}
Thanks to \eqref{log-lik}, maximization of the ELBO
is equivalent to maximization of the log-likelihood $\log p_\theta(x)$.
At the same time, maximization of the log-likelihood
is equivalent to KL minimization, as explained in the post MLE = min KL.
Therefore, maximization of the ELBO \eqref{log-lik} is equivalent to KL minimization,
and in particular to the M-projection.
In this post, we find the quantity, which we call the ‘reverse ELBO’,
that corresponds to the I-projection
in the same way as the ELBO corresponds to the M-projection.

## M-Projection

Let’s first refresh the connection between the M-projection and the ELBO.
Given a collection of i.i.d. samples $x_i, i=1,\dots,N,$
that form the empirical distribution $\hat{p}(x) = N^{-1}\sum_{i=1}^N \delta(x - x_i)$,
maximum likelihood estimation of the parameter $\theta$ in the generative model \eqref{model}
is equivalent to the M-projection
\begin{equation}
\label{mle}
\theta^* = \arg \min_\theta \, \KL(\hat{p}(x) \| p_\theta(x)).
\end{equation}
As shown in the post on EM as KL minimization,
the KL in \eqref{mle} is upper-bounded by another KL between joint distributions
\begin{equation}
\label{m-proj-kl-bound}
\KL(\hat{p}(x) \| p_\theta(x)) \leq
\KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \; \forall q(z|x),
\end{equation}
which in turn equals the expected negated ELBO (see Where is the lower bound?),
\begin{equation}
\label{m-elbo}
\KL(\hat{p}(x)q(z|x) \| p_\theta(x, z))
= \E_{\hat{p}(x)} [-\ELBO_\theta[q|x]]
= -\ELBO_\theta[q].
\end{equation}
Therefore, maximum likelihood estimation of $\theta$
is equivalent to maximization of the evidence lower bound,
\begin{equation}
\label{mle_elbo}
\theta^* = \arg \max_\theta \left\{ \E_{\hat{p}(x)} [\log p_\theta(x)] \right\}
= \arg \max_\theta \left\{ \max_q \, \ELBO_\theta[q] \right\}.
\end{equation}
In other words, *minimization of the KL \eqref{mle}
is equivalent to maximization of the ELBO \eqref{mle_elbo}*.
The distribution $p_{\theta^*}(x)$ obtained by minimizing the KL \eqref{mle}
is called the M-Projection,
where M stands for ‘moment’ to highlight that $p_{\theta^*}(x)$
matches the moments of $\hat{p}(x)$.

## I-Projection

Having seen the relation between the ELBO and the M-projection
in the previous section, let us now find the analog of the ELBO for the I-projection.
First, the assumptions on $\hat{p}(x)$ have to be adjusted
to avoid problems in the definition when $\hat{p}(x)$ contains delta functions.
For concreteness, assume $\hat{p}(x)$ to be an unnormalized density,
such that direct sampling from $\hat{p}(x)$ is impossible
but evaluation at any point is straightforward.
This setting corresponds to posterior inference [1].
The I-projection, also called the information projection, is given by
\begin{equation}
\label{i-proj}
\theta^* = \arg \min_\theta \, \KL(p_\theta(x) \| \hat{p}(x)).
\end{equation}
The distribution $p_{\theta^*}(x)$ obtained through minimization of \eqref{i-proj}
is sometimes characterized as mode-seeking because it concentrates the probability mass
on a single mode of $\hat{p}(x)$.
Conveniently, a straightforward analog of the bound \eqref{m-proj-kl-bound} exists
for the reverse KL \eqref{i-proj}, and it is given by
\begin{equation}
\label{i-proj-kl-bound}
\KL(p_\theta(x) \| \hat{p}(x)) \leq
\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)), \; \forall q(z|x).
\end{equation}
The proof follows the same logic as for the M-projection (see EM as KL minimization).
Since the KL in \eqref{i-proj-kl-bound} is reversed with respect to the KL in the definition
of the ELBO \eqref{m-elbo}, the corresponding negated quantity to \eqref{i-proj-kl-bound}
can be called the *reverse ELBO*.
For the lack of fantasy, let’s denote it $\OLEB$,
\begin{equation}
\label{oleb}
\OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)).
\end{equation}
The OLEB is also a lower bound, but it yields the I-projection when maximized,
\begin{equation}
\theta^* = \arg \max_\theta \left\{ \max_q \, \OLEB_\theta[q] \right\}.
\end{equation}
Thus, once the ELBO is written as the KL divergence \eqref{m-elbo},
the ‘reverse ELBO’ \eqref{oleb} is obtained by simply swapping the arguments in the KL.
However, *optimization* of the OLEB is not as straightforward.
In [1], it was shown that optimization of the OLEB
(denoted $U(\theta, q) = \OLEB_\theta[q]$ in the paper) with respect to $\theta$ can be framed
as an instance of black-box *relative entropy policy search* [2,3],
and a practical algorithm for fitting a Gaussian mixture model $p_\theta(x)$
to a given unnormalized posterior $\hat{p}(x)$ was developed in [1]
based on modern model-based stochastic search techniques [4].

## Conclusion

As we have seen, there indeed exists an analog of the ELBO for the I-projection, which we denoted OLEB \eqref{oleb}. Although calling it ‘reverse ELBO’ is probably misleading because it is not a bound on the evidence, the reference to reversion makes the definition easy to remember. In conclusion, the relationships between the ELBOs and KL projections can be summarized as follows \begin{align} \textrm{M:} \quad \KL(\hat{p}(x) \| p_\theta(x)) \quad \ELBO_\theta[q] = -\KL(\hat{p}(x)q(z|x) \| p_\theta(x, z)), \\ \textrm{I:} \quad \KL(p_\theta(x) \| \hat{p}(x)) \quad \OLEB_\theta[q] = -\KL(p_\theta(x, z) \| \hat{p}(x)q(z|x)). \end{align} Note that usually the assumptions on $\hat{p}(x)$ are different. For the M-projection, it is important to be able to sample from $\hat{p}(x)$, whereas for the I-projection, it is important to be able to evaluate $\hat{p}(x)$. An interesting non-trivial application of the ‘reverse ELBO’ to multimodal posterior inference can be found in [1].

## References

- Arenz, O., Neumann, G., Zhong, M. (2018). Efficient Gradient-Free Variational Inference using Policy Search, ICML, PMLR 80:234-243.
- Peters, J., Mülling, K., Altun, Y. (2010). Relative Entropy Policy Search, AAAI, AAAI Press 10:1607-1612.
- Neumann, G. (2011). Variational Inference for Policy Search in Changing Situations, ICML, Omnipress 11:817-824.
- Abdolmaleki, A., Peters, J., Neumann, G. (2015). Model-Based Relative Entropy Stochastic Search NIPS, Curran Associates Inc. 28:3537-3545.