Bregman divergence of alpha-divergence

16 Apr 2017

Many commonly used divergences between probability measures are special cases of the $f$-divergence.

$f$-Divergence

Definition. Let $\pi$ be a finite measure and $\mu$ a $\sigma$-finite measure on a measurable space $(S, \mathcal{S})$ such that $\pi$ is absolutely continuous with respect to $\mu$, $\pi \ll \mu$. For a convex function $\newcommand{\R}{\mathbb{R}} f \colon (0, \infty) \to \R$ such that $f(1) = 0$, the $f$-divergence from $\mu$ to $\pi$ is defined as $$ D_f(\pi || \mu) \triangleq \int_S f \left( \frac{d\pi}{d\mu} \right) d\mu, $$ where $d\pi/d\mu$ is the Radon-Nikodym derivative of $\pi$ with respect to $\mu$.

The measure $\mu$ is allowed to be $\sigma$-finite to include the case $\mu = \lambda$, where $\lambda$ is the Lebesgue measure.

Example. If $f(x) = x \log x - (x - 1)$, the corresponding $f$-divergence is called the KL divergence and denoted $D(\pi || \mu)$. Note that this definition is also valid for unnormalized measures.

$\alpha$-Divergence

In this section, we introduce a one-parameter family of functions $f_\alpha,\,\alpha \in \R$, that can be used in place of $f$ in the definition of the $f$-divergence.

Notice that $f^\prime = \log$ for $f$ corresponding to the KL divergence. If we replace $\log$ by the $\alpha$-logarithm

$\log_\alpha x \triangleq \frac{x^{\alpha-1} - 1}{\alpha - 1},$

we obtain the $\alpha$-function

$f_\alpha(x) \triangleq \frac{(x^\alpha - 1) - \alpha(x - 1)}{\alpha(\alpha-1)},$

where we choose the constant of integration to fulfill $f_\alpha(1) = 0$.

Definition. For any $\alpha \in \R$, the $f$-divergence with $f = f_\alpha$ is called the $\alpha$-divergence and denoted $D_\alpha$.

Bregman divergence

Definition. Let $f \colon G \to \R$ be a continuously-differentiable strictly convex function defined on a closed convex set $G$. The Bregman divergence generated by $f$ is the function $d_f \colon G \times G \to \R$ given by $$ d_f(y, x) \triangleq f(y) - f(x) - \nabla f(x)^T (y-x). $$

Example. $f_1(x)$ generates $d_{f_1} (y, x) = f_1(\frac{y}{x}) x$.

Example. $f_\alpha(x)$ generates $d_{f_\alpha} (y, x) = f_\alpha(\frac{y}{x}) x^\alpha$.

Bregman divergence generated by $f$-divergence

For fixed $\mu$ and $f$, the $f$-divergence $D_f(\cdot || \mu)$ is a function of $\pi$,

$\pi \mapsto D_f(\pi || \mu) = \int_S f(\pi_\mu)\,d\mu,$

called the $f$-divergence from $\mu$. Since $f$ is non-negative and convex, $D_f(\cdot || \mu)$ is a valid generator of a Bregman divergence.

Proposition. For a finite measure $\rho$ and $\sigma$-finite measures $\nu$ and $\mu$ such that $\rho \ll \nu \ll \mu$, the Bregman divergence from $\nu$ to $\rho$ generated by $D_f(\cdot || \mu)$ is given by $$ \begin{equation} \label{eq:bregman_f} d_{D_f(\cdot || \mu)} (\rho, \nu) = \int_S d_f(\rho_\mu, \nu_\mu) \,d\mu. \end{equation} $$

KL generates KL

Theorem. The Bregman divergence $d_{D(\cdot || \mu)}(\rho, \nu)$ from $\nu$ to $\rho$ generated by the KL divergence $D(\cdot || \mu)$ from $\mu$ does not depend on the measure $\mu$ and equals the KL divergence from $\nu$ to $\rho$, $$ d_{D(\cdot || \mu)}(\rho, \nu) = D(\rho || \nu). $$

Proof. Substituting $d_{f_1}$ from the example above into \eqref{eq:bregman_f}, we obtain

$d_{D(\cdot || \mu)}(\rho, \nu) = \int_S f_1 \left( \frac{\rho_\mu}{\nu_\mu} \right) \nu_\mu \,d\mu = \int_S f_1 (\rho_\nu)\,d\nu = D(\rho || \nu).$

The theorem shows that the KL divergence from $\mu$ is a very special generator of the Bregman divergence: it is a fixed point of the Bregman transform. To make it more formal, consider the function $D_f \colon (\mu, \pi) \mapsto D_f(\pi || \mu)$ and the Bregman transform $d \colon D_f(\cdot || \mu) \mapsto d_{D_f(\cdot || \mu)}$. The theorem asserts that $d(D(\cdot || \mu)) = D$ for any $\sigma$-finite measure $\mu$, where $D \triangleq D_{f_1}$ is the KL divergence.

$\alpha$ Generates $\beta$

Definition. For any $\beta \in \R$, for any finite $\pi$ and $\sigma$-finite $\mu$ and $\lambda$ such that $\pi \ll \mu \ll \lambda$, the $\beta$-divergence from $\mu$ to $\pi$ under $\lambda$ is defined as $$ D_\lambda^\beta (\pi || \mu) \triangleq \int_S f_\beta(\pi_\mu) \mu_\lambda^\beta \,d\lambda, $$ where $f_\beta$ is the $\alpha$-function.

Usually $\lambda$ is taken to be the Lebesgue measure. The $\beta$-divergence is not an $f$-divergence.

Theorem. The Bregman divergence $d_{D_\alpha(\cdot || \mu)} (\rho, \nu)$ from $\nu$ to $\rho$ generated by the $\alpha$-divergence $D_\alpha(\cdot || \mu)$ from $\mu$ is the $\beta$-divergence from $\nu$ to $\rho$ under $\mu$ with $\beta = \alpha$, $$ d_{D_\alpha(\cdot || \mu)} (\rho, \nu) = D_\mu^\alpha(\rho || \nu). $$

Proof. Substituting $d_{f_\alpha}$ into \eqref{eq:bregman_f}, we obtain

$d_{D_\alpha(\cdot || \mu)} (\rho, \nu) = \int_S f_\alpha \left( \frac{\rho_\mu}{\nu_\mu} \right) \nu_\mu^\alpha\,d\mu = \int_S f_\alpha(\rho_\nu) \nu_\mu^\alpha \,d\mu = D_\mu^\alpha(\rho || \nu).$

The theorem asserts that the $\beta$-divergence is the Bregman divergence generated by the $\alpha$-divergence. Note that the ground measure $\mu$ plays a role here, in contrast to the case of the KL divergence, for which the Bregman divergence was independent of $\mu$.

$\beta$ Generates $\beta$

Theorem. The Bregman divergence $d_{D_\mu^\beta(\cdot || \nu)} (\eta, \xi)$ from $\xi$ to $\eta$ generated by the $\beta$-divergence $D_\mu^\beta(\cdot || \nu)$ from $\nu$ under $\mu$ does not depend on the measure $\nu$ and equals the $\beta$-divergence from $\xi$ to $\eta$, $$ d_{D_\mu^\beta(\cdot || \nu)} (\eta, \xi) = D_\mu^\beta(\eta || \xi). $$

Proof. By performing the same manipulations as before, we obtain

$d_{D_\mu^\beta(\cdot || \nu)(\eta, \xi)} = \int_S f_\beta \left( \frac{\eta_\nu}{\xi_\nu} \right) \xi_\nu^\beta \nu_\mu^\beta \,d\mu = \int_S f_\beta(\eta_\xi)\xi_\mu^\beta\,d\mu = D_\mu^\beta(\eta || \xi).$

Thus, the $\beta$-divergence is stable under the Bregman transform, as the KL divergence.

Boris Belousov