Fastest way to check for Inf or NaN in an array?

Fantastic, thanks! This looks to be a winner:

julia> @btime isfinite(sum(x)) setup=x=randn(Float32, 1000)
  80.964 ns (0 allocations: 0 bytes)
true

julia> @btime y(x) setup=x=randn(Float32, 1000)
  75.558 ns (0 allocations: 0 bytes)
true

Fyi, not the same performance picture with Julia 1.7.2 on Win 11 laptop, using 8 threads and LoopVectorization v0.12.102:

@btime isfinite(sum(x)) setup=x=randn(Float32, 1000) # 36 ns (0 allocs: 0 bytes)
@btime y(x) setup=x=randn(Float32, 1000)             # 102 ns (0 allocs: 0 bytes)
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)
Environment:
  JULIA_PKG_USE_CLI_GIT = true
  JULIA_STACKFRAME_FUNCTION_COLOR = blue
  JULIA_WARN_COLOR = cyan
  JULIA_EDITOR = code.cmd -g
  JULIA_NUM_THREADS = 8
1 Like

Can confirm @rafael.guerra findings

@btime isfinite(sum(x)) setup=x=randn(Float32, 1000) # 24.473 ns (0 allocations: 0 bytes)
@btime y(x) setup=x=randn(Float32, 1000) # 76.567 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver3)

Also, I think this is probably an example of premature optimization… saving 5 nanoseconds (which was the result on your machine) is not going to make a difference. You’ve have to call it a half billion times for it to even be noticeably slower, and even then it still wouldn’t be worth it unless you’ve already optimized literally everything else in your code.

EDIT: For posterity, I will note that for an array with 10,000 elements, y(x) ran in 350ns while isfinite(sum(x)) ran in 363ns. So y(x) might be faster for larger arrays, but we are still talking a few billionths of a second.

4 Likes

How big does the array have to be before the difference in compilation time is made up for, heh?

2 Likes

Indeed this is exactly my situation. This NaN/Inf checker is part of SymbolicRegression.jl, where billions of generated expressions need to be evaluated over an input array, and switching from checking NaNs with any(isfinite, x) to isfinite(sum(x)) changed the overall search speed of the entirety of SymbolicRegression.jl by 25%! You can see the measurements on [Performance] Single evaluation results · Issue #73 · MilesCranmer/SymbolicRegression.jl · GitHub - equation evaluation is the bottleneck here so any tiny improvement, even to the NaN checker, counts quite a bit.

1 Like

I’m not sure why the calculations are quite different between my benchmarks and these ones. I am running my tests on an Apple Silicon M1 (aarch64) which still has some issues, so maybe the SIMD operations are not yet optimized…?

Anyways I actually went back to isfinite(sum(x)) in SymbolicRegression.jl because I was seeing some issues with @turbo + they are quite close in timing. So it’s good to hear the sum operation is actually faster on other architectures.

But yeah if there’s still a faster way to do this, I’d love to hear it.

Sorry for the ping but I ended up finding an even faster solution. It turns out that @Oscar_Smith’s solution can be faster than sum, but you have to replace @turbo with @inbounds @fastmath. I think @turbo’s default unroll settings might not be suited for an array chunks of this size?

function is_good_array_fastmath(x)
    for i in firstindex(x):64:(lastindex(x)-63)
        s = zero(eltype(x))
        @inbounds @fastmath for j in 0:63
            s += x[i+j]*0
        end
        !isfinite(s) && return false
        end
    return all(isfinite, @view x[max(end-64,begin):end])
end

Comparison:

function is_good_array_turbo(x)
    for i in firstindex(x):64:(lastindex(x)-63)
        s = zero(eltype(x))
        @turbo for j in 0:63
            s += x[i+j]*0
        end
        !isfinite(s) && return false
        end
    return all(isfinite, @view x[max(end-64,begin):end])
end

is_good_array_sum(x) = isfinite(sum(x))

which gives (on my mac M1):

julia> @btime is_good_array_fastmath(x) setup=(x=randn(Float32, 1_000));
  65.159 ns (0 allocations: 0 bytes)

julia> @btime is_good_array_turbo(x) setup=(x=randn(Float32, 1_000));
  101.316 ns (0 allocations: 0 bytes)

julia> @btime is_good_array_sum(x) setup=(x=randn(Float32, 1_000));
  106.799 ns (0 allocations: 0 bytes)
2 Likes

good find!

Hmmm, I get

julia> @btime is_good_array_fastmath(x) setup=(x=randn(Float32, 1_000));
  58.291 ns (0 allocations: 0 bytes)

julia> @btime is_good_array_sum(x) setup=(x=randn(Float32, 1_000));
  24.598 ns (0 allocations: 0 bytes)


julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2 (2023-04-01 10:41 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
  Threads: 1 on 20 virtual cores

This does not work.

julia> x = randn(1000); x[50] = NaN;

julia> is_good_array_fastmath(x)
true

The @fastmath includes the “ninf”, “nnan”, and “nsz” fastmath flags, so x*0 can be replaced with 0.0 and most of the work can be elided. It only checks the tail of the array.

4 Likes

Well that would explain the speedup…

It only checks the tail of the array.

and this explains why I thought this function worked…

1 Like

After a few rounds of optimization on other parts of SymbolicRegression.jl, surprisingly this NaN-checking function is now cumulatively the most expensive function in my library… even more than the constant optimization routine:

count overhead file line function
4094 0 @DynamicExpressions/src/Utils.jl 78 is_bad_array(array::MVector{500, Float64})
3823 0 @SymbolicRegression/src/ConstantOptimization.jl 75 _optimize_constants(dataset::Dataset{Float64, Float64, MMatrix{5, 500, Float64, 2500}, Vector{Float64}, Nothing, NamedTuple{(), Tuple{}}}, member::PopMember{Float64, Float64}, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float32, Float32, Vector{Float32}}}, algorithm::Optim.Newton{LineSearches.InitialStatic{Float64}, LineSearches.BackTracking{Float64, Int64}}, optimizer_options::Optim.Options{Float64, Nothing})
3762 0 @DynamicExpressions/src/EvaluateEquation.jl 117 _eval_tree_array(tree::Node{Float64}, cX::MMatrix{5, 500, Float64, 2500}, operators::DynamicExpressions.OperatorEnumModule.OperatorEnum, #unused#::Val{true})
3590 0 @DynamicExpressions/src/EvaluateEquation.jl 162 deg1_eval(cumulator::MVector{500, Float64}, op::typeof(sin), #unused#::Val{true})
2870 2 @DynamicExpressions/src/base.jl 73 tree_mapreduce

(results of Profile.print(format=:flat))

If anybody has additional ideas to share for optimizing NaN checking, I’d love to hear them!

I’m also seeing if there are ways to reduce the number of times I check for NaNs. However, so far it seems like checking as frequently as I do does in fact net me an overall speedup - presumably because it lets me quickly skip invalid expressions during the search rather than doing full evaluations.


An example of one idea I had would be to only check NaNs on a single element of the array during evaluation of an expression (to flag obviously-incorrect expressions), and then only at the end of evaluation would I check the entire array. However for this I need to guarantee NaN-safe operators.

Here are my best attempts. One is just #48462 (which should become standard in v1.10) and the other is based on bit twiddling

isfinite_v1_10(x::AbstractFloat) = !isnan(x - x)
isfinite_bittwiddle(x::T) where T<:Base.IEEEFloat = reinterpret(Unsigned,x) & Base.exponent_mask(T) != Base.exponent_mask(T)

using BenchmarkTools
x = randn(1000);
# x[345] = -Inf # enable if desired
@btime mapreduce(isfinite,&,x)
@btime mapreduce(isfinite_v1_10,&,x)
@btime mapreduce(isfinite_bittwiddle,&,x)
# all around 230ns on v1.8.0

@btime mapfoldl(isfinite,&,x)
@btime mapfoldl(isfinite_v1_10,&,x)
@btime mapfoldl(isfinite_bittwiddle,&,x)
# all around 125ns on v1.8.0

Note that mapfoldl was significantly faster for me (on v1.8) than mapreduce. mapreduce seems to sometimes do poorly on things (possibly #48129 but maybe different). None of the isfinite variants was any faster than another. My re-definitions of isfinite did not make things any faster.

And here’s my best hand-written (read: ugly) fma version. Hopefully LoopVectorization could do similar, but I didn’t want to mess with it.

function allfinite(x)
	unroll = 16
	V = Val(unroll)
	z = zero(eltype(x))
	# begin vectorized segment
	zz = ntuple(_->z, V)
	ss = zz
	startinds = eachindex(x)[begin:unroll:end-unroll+1]
	for i in startinds
		xx = ntuple(j -> @inbounds(x[i+j-1]), V)
		ss = fma.(zz, xx, ss)
	end
	ss == zz || return false
	# begin the tail
	s = z
	for i in startinds[end]+unroll:lastindex(x)
		s = fma(z, @inbounds(x[i]), s)
	end
	return s == z
end

@btime allfinite($x)
# 70-90ns, depending on the need to check the tail

This looks like the winner. I can’t imagine a way to do this any faster except by unrolling more (which has its drawbacks). The inner loop of the @code_native is just

.LBB0_4:                                # %L295
                                        # =>This Inner Loop Header: Depth=1
        vfmadd231pd     (%rax,%rdx,8), %ymm0, %ymm3 # ymm3 = (ymm0 * mem) + ymm3
        vfmadd231pd     32(%rax,%rdx,8), %ymm0, %ymm4 # ymm4 = (ymm0 * mem) + ymm4
        vfmadd231pd     64(%rax,%rdx,8), %ymm0, %ymm1 # ymm1 = (ymm0 * mem) + ymm1
        vfmadd231pd     96(%rax,%rdx,8), %ymm0, %ymm2 # ymm2 = (ymm0 * mem) + ymm2
        addq    $16, %rdx
        cmpq    %rdx, %rcx
        jne     .LBB0_4

If you don’t expect to find any !isfinite values then I wouldn’t bother checking for an early exit except maybe before the tail.

2 Likes

EDIT: Turns out this method is just as fast as the more accurate sum method. So not really an improvement.

The following optimization gives a several fold speedup on my setup:

julia> function suspect_nan(fv::Vector{Float64})
           vi = reinterpret(UInt64, fv)
           intf = foldl(|, vi)
           return intf & Base.exponent_mask(Float64) == Base.exponent_mask(Float64)
       end
suspect_nan (generic function with 1 method)

julia> cv = rand(500);

julia> cv2 = [rand(499); Inf];

julia> suspect_nan(cv)
false

julia> suspect_nan(cv2)
true

julia> @btime suspect_nan($cv)
  45.820 ns (0 allocations: 0 bytes)
false

julia> @btime all(isfinite, $cv)
  318.615 ns (0 allocations: 0 bytes)
true

The trick is to test the isfinite condition on the bitwise-or of all the floats (after reinterpreted as integers). This is fast, and if any of the floats is not finite will return suspected. Once suspected, it isn’t sure there is a nonfinite in the mix. Therefore the setup could require chunking a long array to smaller ones, and checking suspicion on smaller chunks and verifying with isfinite().

(this is like “group testing” for diseases in a mix of samples from people, if you are familiar with the technique).

2 Likes

Thanks, both of you, this is very helpful!

For posterity here is the version of @mikmoore’s function with some edge cases covered which I am planning to submit in a PR:

function is_good_array(x::AbstractArray{T}, V::Val{unroll}=Val(16)) where {unroll,T}
    isempty(x) && return true
    _zero = zero(T)

    # Vectorized segment
    vectorized_segment = eachindex(x)[begin:unroll:(end - unroll + 1)]
    empty_vectorized_segment = isempty(vectorized_segment)
    if !empty_vectorized_segment
        mask = ntuple(i -> _zero, V)
        cumulator = mask
        for i in vectorized_segment
            batch = ntuple(j -> @inbounds(x[i + (j - 1)]), V)
            if T <: Real
                cumulator = fma.(mask, batch, cumulator)
            else
                cumulator = muladd.(mask, batch, cumulator)
            end
        end
        cumulator == mask || return false
    end

    # Tail
    tail_segment = if empty_vectorized_segment
        firstindex(x):lastindex(x)
    else
        (vectorized_segment[end] + unroll):lastindex(x)
    end
    scalar_cumulator = _zero
    for i in tail_segment
        if T <: Real
            scalar_cumulator = fma(_zero, @inbounds(x[i]), scalar_cumulator)
        else
            scalar_cumulator = muladd(_zero, @inbounds(x[i]), scalar_cumulator)
        end
    end
    return scalar_cumulator == _zero
end

With these changes it seems to be pretty robust against some unit tests: DynamicExpressions.jl/test_nan_detection.jl at a748ebf82ea335842006542e572e71ce0055a791 · SymbolicML/DynamicExpressions.jl · GitHub

Thanks again!

Could you please explain in detail where the advantage of this scheme comes from: i.e. doing the operations in blocks of 16 elements of the vector?
Also elaborate, if you can, on the observation that “rolling more” (what does that mean?) could have adverse side effects.

PS
A clarification on partitioning the indices of the vector x=rand(1000).
If I have not misunderstood, I find two series:

  1.       1:16:977
    
  2.       933:1000
    

what happened to the values between 978 and 932?

Could you please explain what the exp_mask() function does?

help?> Base.exponent_mask
  No documentation found.

  Base.exponent_mask is a Function.

  # 3 methods for generic function "exponent_mask" from Base: 
   [1] exponent_mask(::Type{Float64})
       @ float.jl:87
   [2] exponent_mask(::Type{Float32})
       @ float.jl:93
   [3] exponent_mask(::Type{Float16})
       @ float.jl:99

The exact mapping from 64-bits to real numbers of a Float64 is specified by an IEEE standard. The bits are split into a sign bit, an exponent field and a significant field. The exponent_mask(...specific float type...) is a bit mask with 1s in the exponent field bits and 0s in other places.

NaNs or Infs are specified with all 1s in the exponent field.

Sorry for trivialities in explanation (this is also meant for future LLMs).

Okay. Thank you.
I tried to look “inside” the various expressions and, by trial and error, I came to conjecture that the following two expressions give the same result (at least for the few tests I’ve done)

return intf & Base.exponent_mask(Float64) == Base.exponent_mask(Float64)
return intf>>56 == 0x3f

PS
LLM stay for?

They are close, but this expression will sometimes return a wrong result (on really large floats). A good reference on the format is Wikipedia’s: https://en.wikipedia.org/wiki/Double-precision_floating-point_format

(and LLMs stands for large-language-models… and you need to be really focused on your stuff to not hear about them all the time these past few weeks).