Performance of `Union{Missing,Float64}`

e3c6 · October 15, 2018, 3:53pm

From https://julialang.org/blog/2018/06/missing:

For example, missing values still have a significant performance impact for arrays of Float64 elements, which are essential for numeric computing.

function sum_nonmissing(X::AbstractArray)
    s = zero(eltype(X))
    @inbounds @simd for x in X
        if x !== missing
            s += x
        end
    end
    s
end

julia> Y1 = rand(10_000_000);

julia> Y2 = Vector{Union{Missing, Float64}}(Y1);

julia> Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);

julia> @btime sum_nonmissing(Y1);
  5.733 ms (1 allocation: 16 bytes)

julia> @btime sum_nonmissing(Y2);
  13.854 ms (1 allocation: 16 bytes)

julia> @btime sum_nonmissing(Y3);
  17.780 ms (1 allocation: 16 bytes)

I still see the performance drop in Julia v1.0.1. But I found no Github issue tracking the performance drop. Is there one open already?

ederag · May 18, 2021, 9:34am

At least missing is better than NaN for this purpose.
(timings under julia-1.6.1)

using BenchmarkTools

function sum_non_nan(X::AbstractArray)
    s = zero(eltype(X))
    @inbounds @simd for x in X
        # it is even slower with isnan()
        if x !== NaN
            s += x
        end
    end
    s
end

julia> Y1 = rand(10_000_000);

julia> @btime sum_nonmissing($Y1);
  2.685 ms (0 allocations: 0 bytes)

julia> @btime sum_non_nan($Y1);
6.807 ms (0 allocations: 0 bytes)

julia> Y2 = Vector{Union{Missing, Float64}}(Y1);

julia> @btime sum_nonmissing($Y2);
  7.112 ms (0 allocations: 0 bytes)

# Y2_nan would be identical to Y1, so see timing above (6.771 ms)

julia> Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);

julia> Y3_nan = Array{Float64}(replace(x->ismissing(x) ? NaN : x, Y3));

julia> @btime sum_nonmissing($Y3);
  12.180 ms (0 allocations: 0 bytes)

julia> @btime sum_non_nan($Y3_nan);
  13.216 ms (0 allocations: 0 bytes)

Elrod · May 18, 2021, 12:43pm

julia> x = rand(10_000_000);

julia> function sum_non_nan(X::AbstractArray)
           s = zero(eltype(X))
           @inbounds @simd for x in X
               # simplify the branch so it can SIMD.
               s += isnan(x) ? zero(x) : x
           end
           s
       end
sum_non_nan (generic function with 1 method)

julia> @btime sum_non_nan($x)
  3.941 ms (0 allocations: 0 bytes)
5.00030913257381e6

julia> @btime sum($x)
  4.268 ms (0 allocations: 0 bytes)
5.0003091325738225e6

anon56330260 · May 18, 2021, 12:44pm

Is your benchmark correct?

julia> @btime sum_non_nan($Y3);
  16.211 ms (0 allocations: 0 bytes)

Here Y3 should be Y3_nan? ~~Also x!==NaN is not the same thing as isnan, since NaN!=NaN.~~ No, NaN===NaN, but NaN != NaN, thanks @fph for pointing out this.

fph · May 18, 2021, 12:48pm

Note that this uses ===-comparison. It is not the same thing as isnan, but it should work as long as one does not use NaNs with payloads and signaling NaNs and that kind of stuff.

ederag · May 18, 2021, 12:57pm

Indeed it should have been Y3_nan. Thanks, fixed now.

mkitti · May 18, 2021, 8:12pm

julia>       function sum_non_nan(X::AbstractArray)
                  s = zero(eltype(X))
                  @inbounds @simd for x in X
                      # simplify the branch so it can SIMD.
                      s += isnan(x) ? zero(x) : x
                  end
                  s
              end

julia> function sum_nonmissing(X::AbstractArray)
           s = zero(eltype(X))
           @inbounds @simd for x in X
                   s += ismissing(x) ? zero(x) : x
           end
           s
       end

julia> Y1 = rand(10_000_000);

julia> Y2 = Vector{Union{Missing, Float64}}(Y1);

julia> Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);

julia> Y3_nan = Array{Float64}(replace(x->ismissing(x) ? NaN : x, Y3));

julia> @btime sum_nonmissing($Y1)
  9.132 ms (0 allocations: 0 bytes)
4.999213478955774e6

julia> @btime sum_non_nan($Y1)
  10.114 ms (0 allocations: 0 bytes)
4.999213478955774e6

julia> @btime sum_nonmissing($Y2);
  17.643 ms (0 allocations: 0 bytes)

julia> @btime sum_nonmissing($Y3);
  13.534 ms (0 allocations: 0 bytes)

julia> @btime sum_non_nan($Y3_nan);
  10.156 ms (0 allocations: 0 bytes)

The only time sum_nonmissing seems more efficient to me is with @btime sum_nonmissing($Y1). That’s because we can ignore the ismissing call when applied to Vector{Float64}.

Vasily_Pisarev · May 18, 2021, 9:42pm

If the NaNs result from an arithmetic operation, they may or may not be equivalent to the literal NaN.

julia> 0.0 / 0.0
NaN

julia> ans === NaN
false

Also, === NaN ignores that there are non-64-bit floats while isnan() is generic in that respect.

jzr · May 18, 2021, 9:59pm

Why does this happen?

Vasily_Pisarev · May 18, 2021, 10:14pm

From Wikipedia page on NaN:

For example, a bit-wise IEEE 754 single precision (32-bit) NaN would be
s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx
where s is the sign (most often ignored in applications) and the x sequence represents a non-zero number (the value zero encodes infinities)

Thus, the bit sequences of two NaNs are not necessarily the same, and my guess is that === for two floats compares their bit sequences.

mkitti · May 19, 2021, 9:54am

This is fun.

julia> @less isnan(0.0 / 0.0)
isnan(x::AbstractFloat) = x != x

julia> x = 0.0 / 0.0
NaN

julia> x != x
true

julia> NaN != NaN
true

julia> 0.0 / 0.0 === -NaN
true

julia> bitstring(0.0/0.0)
"1111111111111000000000000000000000000000000000000000000000000000"

julia> bitstring(NaN)
"0111111111111000000000000000000000000000000000000000000000000000"

julia> bitstring(-0.0/0.0)
"1111111111111000000000000000000000000000000000000000000000000000"

julia> 0.0 / 0.0 === -NaN
true

Tamas_Papp · May 24, 2021, 12:20pm

Note that here you are really relying on -NaN === NaN — there is no “negative NaN”.

Also, isnan is the recommended way of NaN comparisons, as they may carry a payload. See

mkitti · May 24, 2021, 7:47pm

To be clear, that statement evaluates to false.

julia> -NaN === NaN
false

julia> -NaN === -NaN
true

Are you sure there is no “negative NaN”? There definitely seems to be different NaNs to me and QNaNs.jl seems to further that point.

Sukera · May 24, 2021, 9:18pm

What @Tamas_Papp means is that there’s no “negative NaN” in the sense that negativity doesn’t make sense for NaN. NaNess is a property irrespective of the sign bit.

Also:

julia> reinterpret(Int, -NaN) |> bitstring                        
"1111111111111000000000000000000000000000000000000000000000000000"

julia> reinterpret(Float64, reinterpret(Int, -NaN) | 0xff)             
NaN                                                                    
                                                                       
julia> reinterpret(Float64, reinterpret(Int, -NaN) | 0xff) |> bitstring
"1111111111111000000000000000000000000000000000000000000011111111"     
                                                                       
julia> reinterpret(Float64, reinterpret(Int, -NaN) | 0xff) |> isnan    
true                                                                   
                                                                       
julia> reinterpret(Float64, reinterpret(Int, -NaN) | 0xff) === NaN     
false                                                                  
                                                                       
julia> reinterpret(Float64, reinterpret(Int, -NaN) | 0xff) === -NaN    
false

Note how the significand of the second NaN is not 0, but the number is still NaN. It’s the same with the sign bit - all that makes it NaN is an exponent of all ones (which is the intended way NaNs should work according to IEEE 754).

That comparison compares bitwise patterns, not equality. It’s asking whether or not the two values are exactly the same, not whether they’re semantically the same. == (semantic equivalence) indeed gives false:

julia> -NaN |> bitstring                                          
"1111111111111000000000000000000000000000000000000000000000000000"
                                                                  
julia> NaN |> bitstring                                           
"0111111111111000000000000000000000000000000000000000000000000000"
                                                                  
julia> (-NaN) == NaN                                              
false                                                          
                  
# according to IEEE 754, any logical/semantic comparison with NaN should be false
julia> NaN == NaN 
false

The wikipedia article on NaN has a lot of useful info about how NaN can be compared and what the result should be, as well as what NaNs are sometimes used for if there’s no other means of signaling/error checking available.

Tamas_Papp · May 25, 2021, 2:47pm

My understanding of IEEE 754 is that -NaN flipping the sign bit is implementation-dependent, and the sign bit of NaN results may be accidental anyway in conforming implementations when both the input and the output are NaN (it is not part of the payload, and is generally ignored, except for a few special cases enumerated in the standard). See Section 6.3 of IEEE 754-2008.

In any case, I would not rely on this.

Topic		Replies	Views
Question about internal representation of Union{Missing, Float64} Performance question , array , memory	2	1032	October 31, 2018
Is Union{Missing, Float64} a concrete type? Performance	3	630	August 5, 2021
Adding two Array{Union{Missing,Float64}} returns Array{Float64}, is this by design? Internals & Design arrays , missing-values	32	1236	October 10, 2021
Is there any reason to use NaN instead of missing? General Usage missing-values	11	3715	July 19, 2022
With Missings, Julia is slower than R General Usage	30	4227	February 26, 2021

Performance of `Union{Missing,Float64}`

Related topics