Logistic regression as KL minimization

We have seen in a previous post that maximum likelihood estimation is nothing else but KL minization. Therefore, it should come as no surprise that maximum likelihood estimation of logistic regression model parameters is equivalent to KL minimization. Below the relation is shown explicitly.

Problem setting

Let $p(x, t)$ be the true data distribution (density), where $x \in \mathbb{R}^n$ is the input and $t \in \{0, 1\}$ is the target (also called label or output). If we knew $p(x, t)$, we would be absolutely happy because then we could answer any question about $x$ and $t$; for example, “What is the most likely value of $t$ for a given $x$?” However, we can access $p(x, t)$ only through a dataset $D = \{ (x_i, t_i) \}_{i=1}^N$ of input-output pairs $(x_i, t_i)$ drawn from $p(x, t)$. In order to still be able to answer meaningful questions about $(x, t)$, we fit a parametric model $p_w(x, t)$ to the data and then treat that parametric model as the true data distribution $p(x, t)$.

KL minimization

Clearly, one especially aesthetically pleasing way of fitting a parametrized density $p_w(x, t)$ to an empirical density

\begin{equation*} \hat{p}(x, t) = \frac{1}{N} \sum_{i = 1}^N \delta_{t_i, t} \delta(x - x_i) \end{equation*}

is through minimizing the KL between them. (There are other ways; see the awesome book Computational Optimal Transport for alternative approaches to measuring ‘distance’ between clouds of points and continuous distributions). The KL is given by

\begin{equation*} KL(\hat{p}(x, t) \| p_w(x, t)) = \int_{\mathbb{R}^n} \sum_{t \in \{0, 1\}} \log \left( \frac{\hat{p}(x, t)}{p_w(x, t)} \right) \hat{p}(x, t) dx. \end{equation*}

Maximum likelihood estimation

Since we are interested in minimizing it with respect to the parameters $w$, we can get rid of the terms that do not depend on $w$, \begin{equation*} J(w) = - \int_{\mathbb{R}^n} \sum_{t \in \{0, 1\}} \log p_w(x, t) \hat{p}(x, t) dx. \end{equation*}

Upon substitution of the empirical distribution, we recover the negative log-likelihood objective

\begin{equation*} J(w) = - \frac{1}{N} \sum_{i = 1}^N \log p_w(x_i, t_i). \end{equation*}

Logistic regression

Great! But till now we didn’t say anything about logistic regression; the derivation was completely general and it did not depend on how exactly $p_w(x, t)$ was defined. The logistic regression model proposes to split the joint density

\begin{equation*} p_w(x, t) = p_w(t | x) p(x) \end{equation*}

and model the posterior $p_w(t | x)$ explicitly

\begin{equation*} p_w(t | x) = \sigma_w^t(x) (1 - \sigma_w(x))^{1 - t} \end{equation*}

using the logistic function

\begin{equation*} \sigma_w(x) = \frac{1}{1 + e^{-w^T x}} \end{equation*}

while assuming the evidence $p(x)$ to be a fixed distribution with sufficiently wide support covering all regions of space where we have samples. Substituting this model of the joint distribution into the cost function, we obtain

\begin{equation*} J(w) = -\frac{1}{N} \sum_{i = 1}^N \left\{ t_i \log \sigma_w(x_i) + (1 - t_i) \log (1 - \sigma_w(x_i)) \right\}. \end{equation*}

After this point, it’s a matter of optimization to find the minimizer of this cost function.

Additive form of the posterior

Since $t$ can only take values $t = 0$ or $t = 1$, we could have equivalently defined the posterior as

\begin{equation*} p_w(t|x) = t \sigma_w(x) + (1 - t) (1 - \sigma_w(x)). \end{equation*}

Such parametrization was chosen, for example, in the autograd tutorial.


The logistic regression model only estimates the posterior $p(t|x)$, leaving the evidence $p(x)$ unspecified, which limits the number of questions one can answer using this model. Nevertheless, logistic regression covers the very important special case of classification, which makes it an essential tool in the machine learning toolbox.