# EVaR vs MaxEnt vs BI vs EM for PS

Black box policy search (PS) is an optimization problem in which one seeks a policy $\pi(a)$, i.e., a distribution over actions $a \in A$, that minimizes the expected cost $J(\pi) := E_{a\sim\pi} \left[c(a)\right]$. Searching over distributions rather than actions is advantageous when the cost function $c(a)$ is noisy or non-smooth. A key to efficient solution of this problem lies in a good exploration strategy. Several successful algorithms turn out to implicitly introduce optimism that helps to explore the objective landscape more expeditiously. We will look at those algorithms one by one to see how exactly they frame the problem and how optimism arises in each of them.

## Entropic value at risk (EVaR)

The first approach is based on introducing an auxiliary objective function called the entropic value at risk which allows for penalizing higher-order moments of the cost function. This approach is known as risk-sensitive optimization and it has a long history; see Risk sensitivity, a strangely pervasive concept by Peter Whittle for a concise introduction. The risk-sensitive objective reads $$\underset{\theta}{\text{minimize}} \quad J_{\gamma}(\theta):=-\frac{1}{\gamma}\log E_{a\sim\pi_{\theta}} \left[e^{-\gamma c(a)}\right] \label{risk_objective}$$ where $\gamma \in \mathbb{R}$ and $c(a)\geq0,\;\forall a\in A.$ The gradient with respect to $\theta$ is given by $$\nabla J_{\gamma}=-\frac{1}{\gamma} \frac{\nabla E_{\pi}\left[e^{-\gamma c}\right]} {E_{\pi}\left[e^{-\gamma c}\right]} =-\frac{1}{\gamma} \frac{E_{\pi}\left[\left(\nabla\log\pi\right)\cdot e^{-\gamma c}\right]} {E_{\pi}\left[e^{-\gamma c}\right]}. \label{true_risk_gradient}$$ Replacing expectations with sample averages $\hat{E}_{a\sim\pi} \left[ f(a) \right] := N^{-1}\sum_{i=1}^N f(a_i)$, we obtain an estimate of the gradient in the form $$\widehat{\nabla J_{\gamma}(\theta)}:=\hat{E}\left[ \nabla\log\pi_{\theta}\cdot \left(-\frac{1}{\gamma}e^{-\gamma c -\log\hat{E}\left[e^{-\gamma c}\right]}\right) \right]. \label{estimated_risk_gradient}$$ This formula provides a practical way to estimate the gradient of $J_\gamma$ at a given parameter value $\theta$. In general, we cannot pull the nabla out of the expectation because the expectation itself depends on $\theta$. However, we can introduce an empirical objective function $$\hat{J}_{\gamma}(\theta):=\hat{E}\left[\log\pi_{\theta}\cdot \left(-\frac{1}{\gamma}e^{-\gamma c -\log\hat{E}\left[e^{-\gamma c}\right]}\right)\right] \label{empirical_risk_objective}$$ whose gradient coincides with \eqref{estimated_risk_gradient} if the averages in \eqref{empirical_risk_objective} are taken over samples from $\pi_\theta$. The difference between $\nabla \hat{J}_\gamma(\theta)$ and $\widehat{\nabla J_{\gamma}(\theta)}$ lies in the fact that the sampling distribution in \eqref{empirical_risk_objective}, let’s denote it by $q$ such that $\hat{E} := \hat{E}_q$, is assumed to be independent of $\theta$, whereas in \eqref{estimated_risk_gradient}, the sampling distribution $q$ must be equal to the policy, $q = \pi_\theta$. The empirical objective \eqref{empirical_risk_objective} is convenient because it can be directly plugged into any algorithmic differentiation toolbox to compute $\nabla \hat{J}_\gamma(\theta)$.

Depending on $\gamma$, the objective function \eqref{risk_objective} is either risk-seeking ($\gamma > 0$) or risk-averse ($\gamma < 0$). When $\gamma = 0$, it is risk-neutral because $\lim_{\gamma \to 0} J_\gamma(\theta) = J(\theta)$ where $J(\theta) := J(\pi_\theta)$. Since we assumed the cost function to be non-negative, $c(a) \geq 0$, it feels more natural to consider the optimistic case $\gamma > 0$ because then the exponential in \eqref{risk_objective} is bounded between $0$ and $1$. And this is indeed the setting most often encountered in reinforcement learning; see the brilliant Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine for an overview. What concerns the pessimistic case corresponding to $\gamma < 0$, a curious observation was made by Peter Whittle that in the quadratic-cost Gaussian-policy setting, there exists $\gamma_{\min} < 0$ below which the expectation in \eqref{risk_objective} diverges and therefore the objective explodes. This situation is called neurotic breakdown because the agent becomes so pessimistic that it will decide to not move at all.

## Maximum entropy (MaxEnt)

Another approach to policy search is based on the maximum entropy principle. In reinforcement learning, it is known as relative entropy policy search, or REPS for short. Assume that we have a current policy $q$ which we would like to improve with respect to the original objective function $J,$ but under the condition that it does not deviate from the current policy by too much. This problem can be stated as follows \begin{aligned} \underset{\pi}{\text{minimize}} &\quad E_{\pi}\left[c\right]+\frac{1}{\gamma}KL(\pi\|q) \\ \text{subject to} &\quad E_{\pi}\left[1\right]=1. \end{aligned} \label{reps} In contrast to \eqref{risk_objective}, this is a variational optimization problem because no parametric form of $\pi$ is assumed. The solution to this problem is well-known; see the deeply insightful Graphical Models, Exponential Families, and Variational Inference by Martin J. Wainwright and Michael I. Jordan for details on the connection of Problem \eqref{reps} with exponential families. The solution of \eqref{reps} is $$\pi(a) = q(a) e^{-\gamma c(a) - \log E_q \left[ e^{-\gamma c} \right]}. \label{reps_pol}$$ Since the functional form of $c(a)$ is assumed unknown, we cannot obtain $\pi(a)$ as an explicit function of $a$ but can only sample from it by sampling from $q$ and subsequently weighting the samples with the exponential weights from \eqref{reps_pol}. In practice, it is often convenient to represent a policy by a parametric distribution from which it is easy to sample. Knowing that \eqref{reps_pol} is the optimal solution, we can fit a parametric policy $\pi_\theta$ to the samples from $\pi$. The objective for fitting $\pi_\theta$ is the KL minimization objective $$\underset{\theta}{\text{minimize}}\;KL(\pi\|\pi_{\theta})\propto \underset{\theta}{\text{maximize}}\;\hat{E} \left[ \log\pi_{\theta}\cdot e^{-\gamma c-\log\hat{E}\left[e^{-\gamma c}\right]} \right] \label{reps_ml}$$ where $\hat{E} := \hat{E}_q,$ as before. Note that the optimization objective in \eqref{reps_ml} is exactly the empirical objective $\hat{J}_{\gamma}(\theta)$ introduced in \eqref{empirical_risk_objective} up to scaling by $-\gamma^{-1}$. However, whereas \eqref{empirical_risk_objective} was used to compute just one gradient, the same objective in \eqref{reps_ml} must be optimized till convergence. For sufficiently small $\gamma$, the new policy $\pi_\theta$ is close to the current one $q$ because a lot of weight is put on the KL penalty in \eqref{reps}. Therefore, even though we do not just evaluate the gradient of \eqref{reps_ml} but completely optimize it, sampling distribution $q$ remains sufficiently close to $\pi_\theta$ to justify neglecting its influence on the gradient.

It is important to note that Problem \eqref{reps} only makes sense for $\gamma > 0$, whereas Problem \eqref{risk_objective} also allows $\gamma < 0$. However, as we discussed above, the case of interest is usually $\gamma > 0$, so this distinction, although significant, is not so relevant in practice.

## Bayesian inference (BI)

Viewing control as inference has an even longer history than risk-sensitive control, as vividly summarized in Robot Trajectory Optimization using Approximate Inference by Marc Toussaint. Treating action $a$ as a random variable with prior distribution $q(a)$, we can introduce a binary random variable $r$ called reward with the likelihood $p(r = 1 | a) := e^{-\gamma c(a)}.$ Assuming event $r = 1$ was observed, posterior $\pi(a) := p(a|r=1)$ can be computed using Bayes’ rule $$\pi(a)=\frac{q(a)p(r=1|a)}{p(r=1)}=\frac{q(a)e^{-\gamma c(a)}} {E_{q}\left[e^{-\gamma c}\right]} =q(a)e^{-\gamma c(a)-\log E_{q}\left[e^{-\gamma c}\right]}. \label{BI}$$ Despite it might look like black magic that we again and again end up with the same formula, the connection between Bayesian inference and the maximum entropy principle, manifested in the equality of Formulas \eqref{reps_pol} and \eqref{BI} in our case, is a well-studied subject with a lot of beautiful results; see the very enjoyable thesis Maximum Entropy: The Universal Method for Inference by Adom Giffin to learn more about these connections.

## Expectation maximization (EM)

Yet another approach to policy search is to frame the policy optimization problem as an instance of the expectation maximization (EM) algorithm. Let $p_\theta(r, a) = p(r|a) \pi_\theta(a)$ denote our parametric model of the full-data distribution where the likelihood is given by $p(r = 1 | a) := e^{-\gamma c(a)},$ as before. The EM algorithm in general can be viewed as minimizing the KL divergence $$\underset{\theta, q}{\text{minimize}} \;KL_{r,a}\left(\hat{p}(r)q(a)\|p_{\theta}(r,a)\right) \label{EM}$$ using coordinate descent. Here, $\hat{p}(r)$ is the empirical reward distribution, which reduces to $\hat{p}(r) = \delta(r - 1)$ because by assumption we always observe $r = 1$. In the E-step, we fix $\theta$ and optimize with respect to $q$; in our setting, this yields the posterior $q(a) = p_\theta (a|r=1).$ In the M-step, we fix $q$ and optimize with respect to the policy parameters $\theta$. As in \eqref{BI}, let’s denote the prior corresponding to the current $\theta$ by $q(a) := \pi_\theta(a)$ and the posterior by $\pi(a) := p_\theta(a|r=1)$. Then surprise, surprise, the posterior can be writen as $$\pi(a)=\frac{q(a)p(r=1|a)}{p(r=1)} =q(a)e^{-\gamma c(a)-\log E_{q}\left[e^{-\gamma c}\right]}. \label{E-step}$$ In the M-step, we need to solve the problem $$\underset{\theta}{\text{minimize}} \;KL_{r,a}\left( \hat{p}(r)\pi(a) \| p_\theta(r, a) \right)$$ which can be trivially rewritten as $$\underset{\theta}{\text{maximize}} \;E_{a\sim q}\left[ \log\pi_{\theta}(a)\cdot e^{-\gamma c(a) - \log E_q \left[e^{-\gamma c}\right]} \right]. \label{M-step}$$ Needless to say that this M-step \eqref{M-step} is the same as the maximum likelihood optimization \eqref{reps_ml} from REPS, and the objective that we are optimizing here is precisely the empirical risk-sensitive objective $\hat{J}_\gamma(\theta)$ from \eqref{empirical_risk_objective}. For more background on this connection to the EM and links to original papers, see the marvelous Survey on Policy Search for Robotics by Marc Peter Deisenroth, Gerhard Neumann, and Jan Peters.

## Conclusion

No matter from which principle we started, the final conclusion has always been to move in the direction of the gradient of the empirical risk-sensitive objective function $\hat{J}_\gamma(\theta)$ defined in \eqref{empirical_risk_objective}. Practically most important case of small policy updates corresponds to small positive $\gamma,$ as can be seen from \eqref{reps}. In this case, all approaches are equivalent. However, for larger $\gamma$ or for negative $\gamma$ one cannot neglect the fact that the sampling distribution $q$ and the variable distribution $\pi_\theta$ are not the same when optimizing the risk-sensitive objective \eqref{risk_objective}; thus, in this setting, instead of optimizing \eqref{empirical_risk_objective} till convergence, one should rather do just one gradient step and move on.