 # Softmax and large numbers

I need to apply a softmax layer to values that are potentially large and I am getting NaN values. Playing around a bit more I found the following:

julia> using CUDA, Flux

julia> softmax(gpu(Float32[1, 2, 3, Inf]))
4-element CuArray{Float32,1}:
0.0
0.0
0.0
NaN

julia> softmax(Float32[1, 2, 3, Inf])
4-element Array{Float32,1}:
0.0
0.0
0.0
1.0


For comparison, both tensorflow and pytorch return [nan, nan, nan, nan] for this on cpu and gpu. Is this a bug? What is the “correct” implementation?

It seems like the implementations for CUDA and CPU arrays are different. The cpu version explicitly handles infinities, while cuda does not:

vs

If you look at line 57 you see that Inf is handled specially.

1 Like

[0,0,0,1] is unambiguously the correct output, as you can easily see if you compute

\lim_{x\to\infty} \frac{[e^1, e^2, e^3, e^x]}{e^1 + e^2 + e^3 + e^x}
2 Likes

This seems mathematically questionable to me in the case where there are multiple Inf entries, because it assumes that all Inf values are the same (i.e. it acts as though Inf/Inf == 1).

For example, it gives softmax([1,2,Inf,Inf]) == [0,0,0.5,0.5], whereas I would tend to say that a more formally correct answer would be [0,0,NaN,NaN].

That being said, perhaps it is more useful in ML applications to give 0.5 than NaN, i.e. to split the softmax result equally between all Inf entries.

2 Likes

For softmax there’s a difference between “potentially large” and infinite. In the latter case, as discussed in other replies, you have to special case the implementation and in the case of multiple infinities resort to conventions.

For “potentially large”, on the other hand, you will run into trouble with a naive implementation. E.g.

julia> x=Float32.([87, 88, 89, 90])
4-element Vector{Float32}:
87.0
88.0
89.0
90.0

julia> exp.(x) ./ sum(exp.(x))
4-element Vector{Float32}:
0.0
0.0
NaN
NaN


since

julia> exp.(x)
4-element Vector{Float32}:
6.0760303f37
1.6516363f38
Inf
Inf


The canonical solution to this is to first subtract the largest value from all elements, as you can also see in the pasted code in another reply.

julia> y = x .- maximum(x)
4-element Vector{Float32}:
-3.0
-2.0
-1.0
0.0

julia> exp.(y)
4-element Vector{Float32}:
0.049787067
0.13533528
0.36787945
1.0

julia> exp.(y) ./ sum(exp.(y))
4-element Vector{Float32}:
0.032058604
0.08714432
0.23688284
0.6439143


This way all the exponentiated values are scaled proportionally so that the largest value is one and the overflow problems are gone. You might underflow the smaller elements but that has no practical consequence.

2 Likes

Splitting it can be a reasonable graceful degradation and if it’s part of a model that is trained end-to-end it may plausibly learn to take the convention into account. However, most of the time something has gone wrong if you run into infinities at all, and a NaN output will make the problem more apparent.

2 Likes

Thanks everyone, this discussion was really insightful!

This issue came up in a recent major rewriting of the CUDA internals by @denizyuret: https://github.com/JuliaGPU/CUDA.jl/pull/523#issuecomment-753416384

As far as I understand, following this rewrite CUDA’s softmax should default to “accurate” arithmetic by default (via the CUDNN_SOFTMAX_ACCURATE flag), which should properly handle infinities. So it may also be an issue of which CUDA and/or NNlib and/or Julia version you are using.