This seems mathematically questionable to me in the case where there are multiple Inf entries, because it assumes that all Inf values are the same (i.e. it acts as though Inf/Inf == 1).
For example, it gives softmax([1,2,Inf,Inf]) == [0,0,0.5,0.5], whereas I would tend to say that a more formally correct answer would be [0,0,NaN,NaN].
That being said, perhaps it is more useful in ML applications to give 0.5 than NaN, i.e. to split the softmax result equally between all Inf entries.
For softmax there’s a difference between “potentially large” and infinite. In the latter case, as discussed in other replies, you have to special case the implementation and in the case of multiple infinities resort to conventions.
For “potentially large”, on the other hand, you will run into trouble with a naive implementation. E.g.
This way all the exponentiated values are scaled proportionally so that the largest value is one and the overflow problems are gone. You might underflow the smaller elements but that has no practical consequence.
Splitting it can be a reasonable graceful degradation and if it’s part of a model that is trained end-to-end it may plausibly learn to take the convention into account. However, most of the time something has gone wrong if you run into infinities at all, and a NaN output will make the problem more apparent.
As far as I understand, following this rewrite CUDA’s softmax should default to “accurate” arithmetic by default (via the CUDNN_SOFTMAX_ACCURATE flag), which should properly handle infinities. So it may also be an issue of which CUDA and/or NNlib and/or Julia version you are using.