For comparison, both tensorflow and pytorch return [nan, nan, nan, nan] for this on cpu and gpu. Is this a bug? What is the âcorrectâ implementation?
This seems mathematically questionable to me in the case where there are multiple Inf entries, because it assumes that all Inf values are the same (i.e. it acts as though Inf/Inf == 1).
For example, it gives softmax([1,2,Inf,Inf]) == [0,0,0.5,0.5], whereas I would tend to say that a more formally correct answer would be [0,0,NaN,NaN].
That being said, perhaps it is more useful in ML applications to give 0.5 than NaN, i.e. to split the softmax result equally between all Inf entries.
For softmax thereâs a difference between âpotentially largeâ and infinite. In the latter case, as discussed in other replies, you have to special case the implementation and in the case of multiple infinities resort to conventions.
For âpotentially largeâ, on the other hand, you will run into trouble with a naive implementation. E.g.
This way all the exponentiated values are scaled proportionally so that the largest value is one and the overflow problems are gone. You might underflow the smaller elements but that has no practical consequence.
Splitting it can be a reasonable graceful degradation and if itâs part of a model that is trained end-to-end it may plausibly learn to take the convention into account. However, most of the time something has gone wrong if you run into infinities at all, and a NaN output will make the problem more apparent.
As far as I understand, following this rewrite CUDAâs softmax should default to âaccurateâ arithmetic by default (via the CUDNN_SOFTMAX_ACCURATE flag), which should properly handle infinities. So it may also be an issue of which CUDA and/or NNlib and/or Julia version you are using.