Zero gradient when using argmax

The derivative of argmax is zero almost everywhere (and undefined or a delta distribution at discontinuities).

This is why people use differentiable approximations like softmax for optimization (or tricks like epigraph to turn discrete minimax problems into differentiable NLPs).

3 Likes