Softmax and large numbers

gdkrmr · April 6, 2021, 10:50am

I need to apply a softmax layer to values that are potentially large and I am getting NaN values. Playing around a bit more I found the following:

julia> using CUDA, Flux

julia> softmax(gpu(Float32[1, 2, 3, Inf]))
4-element CuArray{Float32,1}:             
   0.0                                    
   0.0                                    
   0.0                                    
 NaN                                      
                                          
julia> softmax(Float32[1, 2, 3, Inf])     
4-element Array{Float32,1}:               
 0.0                                      
 0.0                                      
 0.0                                      
 1.0

For comparison, both tensorflow and pytorch return [nan, nan, nan, nan] for this on cpu and gpu. Is this a bug? What is the “correct” implementation?

DNF · April 6, 2021, 12:43pm

It seems like the implementations for CUDA and CPU arrays are different. The cpu version explicitly handles infinities, while cuda does not:
https://github.com/FluxML/NNlib.jl/blob/master/lib/NNlibCUDA/src/cudnn/softmax.jl
vs

github.com

FluxML/NNlib.jl/blob/2c8af3051cb6b41f8b07f3357edaaed1b5db2f8e/src/softmax.jl#L57


      
          """
          softmax(x; dims = 1) = softmax!(similar(x, (float ∘ eltype)(x)), x; dims = dims)
          
          
softmax!(x; dims = 1) = softmax!(x, x; dims = dims)
          
          
function softmax!(out::AbstractArray{T}, x::AbstractArray; dims = 1) where {T}
              max_ = maximum(x; dims = dims)
              if all(isfinite, max_)
                  out .= exp.(x .- max_)
              else
                  @. out = ifelse(isequal(max_,Inf), ifelse(isequal(x,Inf), 1, 0), exp(x - max_))
              end
              out ./= sum(out; dims = dims)  # could re-use max_ when dims != (:) and eltype(x) == T.
          end
          
          
∇softmax(Δ::AbstractArray{T}, x::AbstractArray, y::AbstractArray{S}; dims = 1) where {T,S} = 
              ∇softmax!(similar(y, promote_type(T, S)), Δ, x, y; dims = dims)
          
          
## Can introduce at the end of deprecation cycle of ∇softmax!(out, Δ, x; dims = 1)  
          #∇softmax!(Δ, x, y; dims = 1) = ∇softmax!(Δ, Δ, x, y; dims = dims)

If you look at line 57 you see that Inf is handled specially.

stevengj · April 6, 2021, 1:12pm

[0,0,0,1] is unambiguously the correct output, as you can easily see if you compute

\lim_{x\to\infty} \frac{[e^1, e^2, e^3, e^x]}{e^1 + e^2 + e^3 + e^x}

stevengj · April 6, 2021, 1:17pm

DNF:

function softmax!(out::AbstractArray{T}, x::AbstractArray; dims = 1) where {T}
    max_ = maximum(x; dims = dims)
    if all(isfinite, max_)
        out .= exp.(x .- max_)
    else
        @. out = ifelse(isequal(max_,Inf), ifelse(isequal(x,Inf), 1, 0), exp(x - max_))
    end
    out ./= sum(out; dims = dims)  # could re-use max_ when dims != (:) and eltype(x) == T.
end

This seems mathematically questionable to me in the case where there are multiple Inf entries, because it assumes that all Inf values are the same (i.e. it acts as though Inf/Inf == 1).

For example, it gives softmax([1,2,Inf,Inf]) == [0,0,0.5,0.5], whereas I would tend to say that a more formally correct answer would be [0,0,NaN,NaN].

That being said, perhaps it is more useful in ML applications to give 0.5 than NaN, i.e. to split the softmax result equally between all Inf entries.

GunnarFarneback · April 6, 2021, 1:40pm

For softmax there’s a difference between “potentially large” and infinite. In the latter case, as discussed in other replies, you have to special case the implementation and in the case of multiple infinities resort to conventions.

For “potentially large”, on the other hand, you will run into trouble with a naive implementation. E.g.

julia> x=Float32.([87, 88, 89, 90])
4-element Vector{Float32}:
 87.0
 88.0
 89.0
 90.0

julia> exp.(x) ./ sum(exp.(x))
4-element Vector{Float32}:
   0.0
   0.0
 NaN
 NaN

since

julia> exp.(x)
4-element Vector{Float32}:
  6.0760303f37
  1.6516363f38
 Inf
 Inf

The canonical solution to this is to first subtract the largest value from all elements, as you can also see in the pasted code in another reply.

julia> y = x .- maximum(x)
4-element Vector{Float32}:
 -3.0
 -2.0
 -1.0
  0.0

julia> exp.(y)
4-element Vector{Float32}:
 0.049787067
 0.13533528
 0.36787945
 1.0

julia> exp.(y) ./ sum(exp.(y))
4-element Vector{Float32}:
 0.032058604
 0.08714432
 0.23688284
 0.6439143

This way all the exponentiated values are scaled proportionally so that the largest value is one and the overflow problems are gone. You might underflow the smaller elements but that has no practical consequence.

GunnarFarneback · April 6, 2021, 1:57pm

Splitting it can be a reasonable graceful degradation and if it’s part of a model that is trained end-to-end it may plausibly learn to take the convention into account. However, most of the time something has gone wrong if you run into infinities at all, and a NaN output will make the problem more apparent.

gdkrmr · April 6, 2021, 7:35pm

Thanks everyone, this discussion was really insightful!

jondeuce · April 6, 2021, 9:32pm

This issue came up in a recent major rewriting of the CUDA internals by @denizyuret: https://github.com/JuliaGPU/CUDA.jl/pull/523#issuecomment-753416384

As far as I understand, following this rewrite CUDA’s softmax should default to “accurate” arithmetic by default (via the CUDNN_SOFTMAX_ACCURATE flag), which should properly handle infinities. So it may also be an issue of which CUDA and/or NNlib and/or Julia version you are using.

Topic		Replies	Views
Flux gpu gradient failing General Usage question	5	850	August 18, 2020
Bug? Using Flux & getting Float32 response on 64 bit Ubuntu OS Machine Learning	4	1216	March 4, 2019
Code using Flux slow on GPU GPU flux	9	3073	November 6, 2019
Code works on CPU but not on GPU Machine Learning cuda , flux	6	968	July 26, 2023
Flux training gives NaNs Machine Learning	3	1119	July 3, 2023

Softmax and large numbers

Related topics