Argmax vs Softmax vs Sparsemax

1 minute read


A summary inspired by the SparseMAP paper.


Argmax is the backbone of softmax and sparsemax. Suppose we want to get a probability distribution given a set of unnormalized scores $\theta$, the optimization problem is:

where the simplex $\bigtriangleup^d$ says $\sum_i y_i = 1$ and $\forall_i y_i \geq 0$, i.e. to make $y$ looks like a distribution.


Softmax, on the other hand, can be formulated on top of argmax.

where $-y^T ln(y)$ is a negative entropy prior/normalizer. (This form is exactly as appeared in the SparseMAP paper.) The immediate question is that: how this equation is softmax?

To see why, first we need to solve $y_i$. By rewriting the above optimization as:

we can see the objective is strictly convex. Thus we can take its Lagrangian:

With KKT conditions and slackness, we have the followings:

Then we have $y_i = exp(\theta_i + \lambda_1 - 1)$. To solve $\lambda_1$, we will need the simplex constraint:

Plugging this back gives $y_i = \frac{e^i}{\sum_i e^{\theta_i}}$, i.e. softmax itself.


Sparsemax uses L-2 normalizer instead of using a negative entropy prior.

Again, by rewriting in $\arg\min$, we can see the objective is strictly convex. But the problem is in fact a QP problem. Further, because it does not have the $ln(y)$ term like softmax, nothing prohibits $y_i=0$. So the solving becomes challenging. Simply put, one can use ADMM or off-the-shelf solver for the QP optimization.

But eitherway, we still need the gradient in backward pass for end-to-end learning.

Well, gradient TODO.