Argmax vs Softmax vs Sparsemax
Published:
A summary inspired by the SparseMAP paper.
1.Argmax
Argmax is the backbone of softmax and sparsemax. Suppose we want to get a probability distribution given a set of unnormalized scores $\theta$, the optimization problem is:
where the simplex $\bigtriangleup^d$ says $\sum_i y_i = 1$ and $\forall_i y_i \geq 0$, i.e. to make $y$ looks like a distribution.
2.Softmax
Softmax, on the other hand, can be formulated on top of argmax.
where $-y^T ln(y)$ is a negative entropy prior/normalizer. (This form is exactly as appeared in the SparseMAP paper.) The immediate question is that: how this equation is softmax?
To see why, first we need to solve $y_i$. By rewriting the above optimization as:
we can see the objective is strictly convex. Thus we can take its Lagrangian:
With KKT conditions and slackness, we have the followings:
Then we have $y_i = exp(\theta_i + \lambda_1 - 1)$. To solve $\lambda_1$, we will need the simplex constraint:
Plugging this back gives $y_i = \frac{e^i}{\sum_i e^{\theta_i}}$, i.e. softmax itself.
3.Sparsemax
Sparsemax uses L-2 normalizer instead of using a negative entropy prior.
Again, by rewriting in $\arg\min$, we can see the objective is strictly convex. But the problem is in fact a QP problem. Further, because it does not have the $ln(y)$ term like softmax, nothing prohibits $y_i=0$. So the solving becomes challenging. Simply put, one can use ADMM or off-the-shelf solver for the QP optimization.
But eitherway, we still need the gradient in backward pass for end-to-end learning.
Well, gradient TODO.