Argmax vs Softmax vs Sparsemax

1 minute read

Published: January 10, 2019

A summary inspired by the SparseMAP paper.

1.Argmax

Argmax is the backbone of softmax and sparsemax. Suppose we want to get a probability distribution given a set of unnormalized scores $\theta$, the optimization problem is:

$y = \arg\max_{y \in \bigtriangleup^d} \theta^T y$

where the simplex $\bigtriangleup^d$ says $\sum_i y_i = 1$ and $\forall_i y_i \geq 0$, i.e. to make $y$ looks like a distribution.

2.Softmax

Softmax, on the other hand, can be formulated on top of argmax.

$y = \arg\max_{y \in \bigtriangleup^d} \theta^T y - y^T ln(y)$

where $-y^T ln(y)$ is a negative entropy prior/normalizer. (This form is exactly as appeared in the SparseMAP paper.) The immediate question is that: how this equation is softmax?

To see why, first we need to solve $y_i$. By rewriting the above optimization as:

$y = \arg\min_{y \in \bigtriangleup^d} - \theta^T y + y^T ln(y)$

we can see the objective is strictly convex. Thus we can take its Lagrangian:

$L(y, \lambda_1, \lambda_2) = y^T ln(y) - \theta^T y + \lambda_1 (1 - 1^T y) + \lambda_2^T y$

With KKT conditions and slackness, we have the followings:

$\frac{\partial L(y, \lambda_1, \lambda_2)}{\partial y_i} = ln(y_i) + 1 - \theta_i - \lambda_1 + \lambda_2 = 0$ $\text{i.e. } y_i = exp(\theta_i + \lambda_1 - \lambda_2 - 1)$ $\forall_i \lambda_{2i} = 0 \text{, since } ln(y_i) \text{ prohibits } y_i=0$

Then we have $y_i = exp(\theta_i + \lambda_1 - 1)$. To solve $\lambda_1$, we will need the simplex constraint:

$\sum_i y_i = \sum_i e^{\theta_i + \lambda_1 - 1} = 1$ $e^{\lambda_1} = \frac{e}{\sum_i e^{\theta_i}}$

Plugging this back gives $y_i = \frac{e^i}{\sum_i e^{\theta_i}}$, i.e. softmax itself.

3.Sparsemax

Sparsemax uses L-2 normalizer instead of using a negative entropy prior.

$y = \arg\max_{y \in \bigtriangleup^d} \theta^T y - \frac{1}{2}y^T y$

Again, by rewriting in $\arg\min$, we can see the objective is strictly convex. But the problem is in fact a QP problem. Further, because it does not have the $ln(y)$ term like softmax, nothing prohibits $y_i=0$. So the solving becomes challenging. Simply put, one can use ADMM or off-the-shelf solver for the QP optimization.

But eitherway, we still need the gradient in backward pass for end-to-end learning.

Well, gradient TODO.

Share on

Twitter Facebook Google+ LinkedIn

Tao Li

1.Argmax

2.Softmax

3.Sparsemax

Share on