Fastest way to check for Inf or NaN in an array?

First, you should almost never write code like what we wrote. It’s difficult to read and prone to mistakes. Usually, the compiler is smart enough to do this automatically when it’s permissible to do so. This can be blocked by things like the fact that floating point math is non-associative and you explicitly wrote the loop to add them first to last. The compiler won’t ignore that normally, but you can mark the loop @simd or use @fastmath (WARNING: @fastmath can have all sorts of other consequences and should only be used very carefully) to tell it you don’t actually care about the order it adds them in.

The LoopVectorization package’s @turbo macro is usually better at vectorizing loops than the normal compiler. It’s my usual recommendation, but I didn’t have it installed and didn’t want to fiddle with it if it didn’t do what I wanted. I hand-wrote that monstrosity because it was simple and a proof of concept just to see what level of speed could be had.


There are two major factors that can greatly accelerate a loop. Both are about doing the same work but with fewer instructions. Your CPU has instructions for adding a pair of Float64s, for example. But most contemporary general purpose CPUs have Single Instruction Multiple Data (SIMD) aka Vectorized instructions. Vectorized instructions are like scalar instructions except that they operate on more data at once. For example, my processor also has instructions for adding 2 or 4 Float64 (or 4 or 8 Float32) pairs at a time (some recent processors can double this again). There are similar instructions for loading data and other arithmetic. So, instead of loading a Float64 and adding it to an accumulator, my computer can choose to load 4 adjacent Float64 and add them to 4 different accumulators. That’s an easy 4x speedup right there (although usually requires a little extra work at the end to add up the 4 accumulators).

Look at the assembly from above.

vfmadd231pd is a “fused multiply-add” instruction. It takes one number, multiplies it by a second, and adds it to a third. This means it is faster and has less roundoff error than a multiply instruction followed by an add. The pd suffix stands for “packed double”, which is to say it operates on a vector of Float64 values and the ymm registers means that it is doing 4 per instruction. The remaining 3 instructions increment the loop counter, check if it’s finished, and loop back to the top if it’s not.

Now to your specific question. Unrolling is another optimization that addresses the fact that, in a simple loop, you can spend a majority of the time just doing the loop overhead instead of doing the actual calculations you wanted. Imagine that we instead used only one fma and then looped back. We would have 1 fma that processed 1 number and then 3 instructions to do the looping, so 1 number multiplied-and-accumulated per 4 instructions. Using a vectorized operation can get us up to 4 numbers per 4 instructions. But this actual code uses 4 vectorized fma in a row, so processes 16 numbers per 7 instructions (not all instructions take the same time to process, but I’m ignoring that for now). This process is called “loop unrolling.”

At the very best, we could have N+3 instructions unrolled to process 4N numbers per actual loop. Two things stand in the way. First, we only have so many registers (eg ymm0) to store numbers in the processor. If we run out of those, we get a “register spill” and have to start using the stack, which is part of main memory and will slow us down considerably. Second, this code block can only process numbers in blocks of 16. If we want to add up 60 numbers, we run this loop 3 times to do the first 48 numbers but then have 12 numbers left over that we must handle. Usually, this “tail” is processed using a boring loop that only handles 1 number per loop. Recall that this tail is processing numbers 9x slower than our unrolled section. Those last 12 numbers can easily take twice as long as the first 48. Unrolling also takes more instructions, which can have negative effects on cache performance.

So vectorized and unrolled loops can be very fast, but the more efficient they get the bigger tails they can leave. These tails can be very slow to process. The compiler seems to like 4x unrolling for simple loops, so that’s what I replicated here (16 = vectorize by 4 * unroll by 4).

8 Likes

Thank you very much.
I think I understand the meaning of what you wrote. I’ll have to spend some time to understand better all aspects and implications.
I propose a variant of the scheme to apply an element of the “lesson”.
I would have expected something more … but this is because everything is not clear to me.

function allfinite_untail(x)
	unroll = 16
	V = Val(unroll)
	z = zero(eltype(x))
	# begin vectorized segment
	zz = ntuple(_->z, V)
	ss = zz
	startinds = eachindex(x)[begin:unroll:end-unroll+1]
	for i in startinds
		xx = ntuple(j -> @inbounds(x[i+j-1]), V)
		ss = fma.(zz, xx, ss)
	end
	ss == zz || return false
    t=length(x)-unroll+1
    zz = ntuple(_->z, V)
    ss=zz
	xx = ntuple(j -> @inbounds(x[t+j-1]), V)
	ss = fma.(zz, xx, ss)
	return ss == zz
end
julia> 


  67.623 ns (0 allocations: 0 bytes)
true

  64.862 ns (0 allocations: 0 bytes)
true

the first result refers to the allfinite() function the second to the variant.
Having problems with julia 1.9.0 and setting the output to REPL with shift-enter in vscode, I had to remove the send echo.

PS
what if t=1?

Good question. Another important part of SIMD is that they can slow down a lot of they don’t line up with cache lines in your RAM. If they don’t line up, it can make them considerably slower. With what you wrote, it’s very possible that they won’t and it can slow down a fair bit.

Another issue would be that you still need a version to handle arrays shorter than unroll.

Still, this sort of thing may work out okay.

I found it super useful as a pedagogical example! Before this I had never wrote a vectorization out by hand like that; I always relied on @turbo without intuition for what is happening internally.

Also: LoopVectorization is incompatible with certain functions or workflows, including use within Enzyme-differentiated code, so it’s very useful to know what to do when you can’t use it.

For example I’m not sure how to replicate this particular example with @turbo. See @Oscar_Smith’s code above: Fastest way to check for Inf or NaN in an array? - #20 by Oscar_Smith. However, the vectorization in that example is not used in the isfinite check, which is perhaps why this hand-written kernel is faster?

Request for an additional lesson. Thank you.

I tested by inserting a check for the early exit from for loop, putting x[500]=NaN and you get a fair improvement.
One question: does vectorization also work for logical comparison operations, as well as for arithmetic sum-product operations?
I tried with unroll=32 and the results are conflicting or unclear.
How do you (if you can) know the optimal unroll value? depends on the HW?
Is there an advantage to using V=Val(16) instead of V=16?

Checking for an early exit requires checking the exit condition. This means extra instructions. Early exits are often faster if they’re successful, but if they fail they’re just adding extra cost to check.

The branch involved in early exiting can sometimes prevent SIMD as well. Recall that the basic loop did 9x less work per instruction than the vectorized one, and adding a early exit check might increase this to 12x. If you did not exit in the first 1/12 of the vector, the basic loop would be slower than checking the entire input using SIMD. Intermittent exit checks strike a better compromise, but compilers don’t usually write them on their own (at present). Since we have the manually unrolled version above, one could check once every unrolled loop with a better tradeoff. That said, this isn’t very common in practice. At some point the microoptimization isn’t worth bloating the source code. Some day compilers might get better (#48665 for a Julia issue on the topic, but Julia is hardly alone in not having support for this, and even LLVM shows it as a future topic).

For example, you will throw an error if it fails the isfinite check then you probably don’t care about getting to that error a teensie bit faster in exchange for making every non-erroring run slower.

Vectorization works for many operations, including arithmetic, comparisons, and logical operations, but you’ll need to look elsewhere to find a comprehensive list. A notable place where vectorization is not supported is on branches. Branches include loops (at least segments that aren’t unrolled) and if statements. In the assembly, you can see “jump” instructions (eg the jne .LBB0_4 above, for "jump if the previous values compared not-equal).

An exception to branching is some simple ternary-style operations like if cond; A; else; B; end where A and B are the same type. Sometimes writing these using the ifelse function will help the processor see this, but other times it’s smart enough on its own. Processors have “conditional move”-style instructions (on x86, these mostly correspond to cmov for scalar and blendv for vector) that select an argument based on a condition and do not actually require a branch.

Because floating point arithmetic is not associative (e.g., (1e200 + 1e0) - 1e200 == 0), floating point reductions (such as sums) typically prevent SIMD. The @simd macro applied to a loop, or @fastmath (be careful with @fastmath – for example it would almost certainly break this code because it also allows the compiler to assume that Inf and NaN are not present, although with a current “bug” @simd might do this too), tells the compiler that it can pretend floating point math is associative which can let it reorder operations to be faster. Likewise, an early exit requires that it stop looking at values once it should exit, which can prevent SIMD and unrolling (unless it can do these things between checks, like the manually unrolled version).

Optimal unrolling depends on many things. How long is the underlying vector, how complex is the per-value computation? How many arithmetic units are available on the processor (modern CPUs do not just execute one instruction at a time)? What is the cache arrangement? How good is the branch predictor? How expensive is computing the tail of the vector? And more…

I used 16 because I often see the compiler use this factor on my machine. Technically, in that loop, you might do better to only use 4 or 8 accumulators and make any extra unrolling add into those (fewer registers spent on accumulators, fewer accumulators to check at the end, but enough to saturate the arithmetic units of a superscalar processor).

I used Val because ntuple can accept either an integer or Val(int) for the length argument. Using Val with a compile-time constant ensures it knows at compile time how long it should be. Sometimes it’s not necessary to wrap the length in Val to achieve this, but in this case it seemed it was. Try removing the Val wrapper and compare performance.


P.S.:

I’ll encourage you to keep all of this in perspective. Is it truly necessary to hyper-optimize this isfinite check? At the cost of the much messier source code it produces? Is the cost of this logic a bottleneck compared to all the work you do on these values after? Are you sure the parameters that benchmark best on your machine will benchmark well on others? Hyperoptimization can be an entertaining and educational exercise, but one should balance this against having correct, legible, and maintainable source code. This is why most of the time people just leave vectorization to the compiler.

I’ll also remark that most but not all contemporary processors support FMA. When a processor does not, it uses a very expensive software emulation (maybe 30x slower?). Since you don’t actually care about the arithmetic considerations that FMA does, separate multiplies and adds would be much faster on non-FMA processors. muladd is an optional version of fma and should probably be preferred here if possible.

1 Like

This is certainly the spirit in which I have participated in this debate.
Maybe I even figured out where the 9x comes from :grinning:
Is it from (16/7)/(1/4)=64/7=9.1…?

1 Like

Keep in mind that I measured this tiny function to be the largest bottleneck in my 12,000 line codebase (how I wish this was purely for entertainment) :wink: But otherwise I completely agree!

Thanks for the info about fma not being available on some systems, I will look into it. It could be why the GitHub actions did not actually seem as fast as on my laptop.

I just realized that muladd actually gets mapped to fma if available on the hardware: Mathematics · The Julia Language

So I think I should go with that instead. i.e., here’s the updated solution:

function is_good_array(x::AbstractArray{T}, ::Val{unroll}=Val(16)) where {unroll,T}
    isempty(x) && return true
    _zero = zero(T)

    # Vectorized segment
    vectorized_segment = eachindex(x)[begin:unroll:(end - unroll + 1)]
    empty_vectorized_segment = isempty(vectorized_segment)
    if !empty_vectorized_segment
        mask = ntuple(i -> _zero, Val(unroll))
        cumulator = mask
        for i in vectorized_segment
            batch = ntuple(j -> @inbounds(x[i + (j - 1)]), Val(unroll))
            cumulator = muladd.(mask, batch, cumulator)
        end
        cumulator == mask || return false
    end

    # Tail
    tail_segment = if empty_vectorized_segment
        firstindex(x):lastindex(x)
    else
        (vectorized_segment[end] + unroll):lastindex(x)
    end
    scalar_cumulator = _zero
    for i in tail_segment
        scalar_cumulator = muladd(_zero, @inbounds(x[i]), scalar_cumulator)
    end
    return scalar_cumulator == _zero
end

(marked as solution for copy-pasting)

Here are some benchmark results on my macbook M1, for m the size of the array and n the unroll length.

m=50 n=2 getsTrial(83.000 ns)
m=50 n=4 getsTrial(41.000 ns)
m=50 n=8 getsTrial(42.000 ns)
m=50 n=16 getsTrial(83.000 ns)
m=50 n=32 getsTrial(83.000 ns)
m=50 n=64 getsTrial(83.000 ns)
m=100 n=2 getsTrial(83.000 ns)
m=100 n=4 getsTrial(41.000 ns)
m=100 n=8 getsTrial(41.000 ns)
m=100 n=16 getsTrial(42.000 ns)
m=100 n=32 getsTrial(83.000 ns)
m=100 n=64 getsTrial(125.000 ns)
m=500 n=2 getsTrial(250.000 ns)
m=500 n=4 getsTrial(125.000 ns)
m=500 n=8 getsTrial(83.000 ns)
m=500 n=16 getsTrial(83.000 ns)
m=500 n=32 getsTrial(84.000 ns)
m=500 n=64 getsTrial(166.000 ns)
m=1000 n=2 getsTrial(583.000 ns)
m=1000 n=4 getsTrial(333.000 ns)
m=1000 n=8 getsTrial(166.000 ns)
m=1000 n=16 getsTrial(125.000 ns)
m=1000 n=32 getsTrial(83.000 ns)
m=1000 n=64 getsTrial(166.000 ns)
m=5000 n=2 getsTrial(3.041 μs)
m=5000 n=4 getsTrial(1.583 μs)
m=5000 n=8 getsTrial(791.000 ns)
m=5000 n=16 getsTrial(375.000 ns)
m=5000 n=32 getsTrial(333.000 ns)
m=5000 n=64 getsTrial(458.000 ns)
m=10000 n=2 getsTrial(6.083 μs)
m=10000 n=4 getsTrial(3.042 μs)
m=10000 n=8 getsTrial(1.541 μs)
m=10000 n=16 getsTrial(791.000 ns)
m=10000 n=32 getsTrial(750.000 ns)
m=10000 n=64 getsTrial(917.000 ns)

made with

julia> suite = let s=BenchmarkGroup()
           for n in 2 .^ (1:6), m in [50, 100, 500, 1_000, 5_000, 10_000]
               s[n, m] = @benchmarkable is_good_array(x, v) setup=(x=randn($m); v=Val($n)) samples=10
           end
           s
       end

julia> results = run(suite)
3 Likes

This has been haunting me so I finally installed LoopVectorization.jl. It’s a clear winner over my hand-written allfinite above:

using LoopVectorization
function allfinite_turbo(x)
	z = zero(eltype(x))
	s = z
	@turbo for i in eachindex(x)
		s = muladd(z, x[i], s)
	end
	return s == z
end

using BenchmarkTools
x = randn(1000);
@btime allfinite($x) # 85ns
@btime allfinite_turbo($x) # 41ns

I will strongly advocate for the LoopVectorization.@turbo version. It’s so much easier to read as source code and it managed a further 2x speedup over my initial allfinite version with manual unrolling.

I looked into the @code_native to see what it did differently. It opted for 32x unrolling (8x instructions * 4x SIMD) rather than my 16, but otherwise the primary loop is identical. But it appears to have done better at combining the accumulators after that. Then it resolves the tail first with some 8x unrolling and then (as far as I can tell) finally resolves the tail’s tail (length in 0:7) with some manual branching (rather than looping, for ~log2(8) branches instead of up to 8).

Simply changing my original version to 32x unrolling did not significantly improve the runtime at this input length. Despite them both using the same core loop, a lot was lost in reducing the accumulators and in my lazy 1-at-a-time tail handling. I could change my source to emulate the LoopVectorization version, but I wouldn’t do any better and it did it all without my involvement and without making the source code a total mess.

9 Likes

Nice!

Actually looks like no need for the explicit muladd either:

function is_good_array(x::AbstractArray{T}) where {T}
    s = zero(T)
    @turbo for i in eachindex(x)
        s += x[i] * 0
    end
    return s == 0
end

gets the same performance.

Also I just realized (oops!) that @Oscar_Smith posted a nearly identical suggestion earlier in the thread:

I think we just got distracted with early stopping strategies (if you know an array likely has a NaN, then you can likely get a better performance with e.g., Fastest way to check for Inf or NaN in an array? - #20 by Oscar_Smith)

The one difference is the use of ==(0) rather than isfinite, which gets a tiny performance boost.

1 Like

The muladd is almost-certainly needed here under normal circumstances to emit fma operations instead of mul-then-add. The only way it wouldn’t be is if the compiler recognized that the fact that you’re multiplying by zero meant it could use fma anyway (if that’s even true – it might not be), but I wouldn’t count on that.

The reason it does not need muladd here that @turbo applies several (but not all) of the optimizations from @fastmath. muladd is equivalent to calling x*y+z with the contract fast-math flag. Looking at the @code_llvm from this function, it appears that @turbo applies the reassoc nsz arcp contract afn flags to most operations. So writing x*y+z within a @turbo block is equivalent to muladd with some extra flags.

If you dropped the muladd/fma from a version without these flags (eg the hand-written one), it would almost-certainly use mul-then-add because they are not equivalent to fma and floating point rules are very strict on this matter.

@mikmoore It seems like @turbo’s optimizations can break the NaN detection sometimes:

If I take your function above:

using LoopVectorization
function allfinite_turbo(x)
	z = zero(eltype(x))
	s = z
	@turbo for i in eachindex(x)
		s = muladd(z, x[i], s)
	end
	return s == z
end

and plug in a static array with size 1:

julia> using StaticArrays

julia> x = @SVector[NaN]
1-element SVector{1, Float64} with indices SOneTo(1):
 NaN

it fails to detect the NaN:

julia> allfinite_turbo(x)
true

it works when there is more than one element though, presumably because it stops some optimization from happening:

julia> x = @SVector[1, NaN]
2-element SVector{2, Float64} with indices SOneTo(2):
   1.0
 NaN

julia> allfinite_turbo(x)
false

Additional notes:

  1. I don’t see this for regular arrays; only static arrays. So when the compiler knows the size of the array, I guess it can optimize away the check.
  2. Once I remove the @turbo, this issue goes away.

I could simply patch this for size-1 arrays, but the worry is that there might be other cases where it optimizes too much, and the NaN detection breaks.

Look at

code_llvm(allfinite_turbo,Tuple{SVector{1,Float64}})
code_llvm(allfinite_turbo,Tuple{SVector{1,Float64}};optimize=false)

You’ll see that the optimized version is just returns true. The pre-optimized version does a bunch of stuff and then has some lines that look like

     %16 = fmul fast double 0.000000e+00, %15
     %17 = fadd fast double %16, 0.000000e+00

The fast flags here mean that all fastmath flags are enabled. In particular, this includes the ninf and nnan flags, which means the compiler can assume that the inputs and outputs are neither infinities nor NaNs. As such, it knows that %17 is zero and so of course it passes the == 0 check that follows, hence return true.

It appears that with longer SVectors that it uses a more conservative set of flags reassoc nsz arcp contract afn and the issue no-longer occurs.

Julia has a vaguely similar mis-feature/bug #49387 but this one belongs to LoopVectorization. You’ll need to open an issue on LoopVectorization itself.

1 Like

In the @turbo docstring it says there are function in VectorizationBase.jl like vmul and vadd that will ignore @fastmath flags. Maybe that would do it…

Okay so it seems like VectorizationBase.vmul/vadd are not supported within @turbo at the moment.

I think for now I will just go with the manual vectorization (with the ntuple), just to prevent any silent bugs from undetected NaNs due to weird optimization branches in @turbo.

Here’s a version with LoopVectorization.jl but with the above bug fixed. It’s more concise, but about 2x slower (still faster than the manual vectorization, though). Perhaps the additional 2x speed is from a fastmath?

using LoopVectorization

allfinite_turbo2(x) = vmapreduce(xi->xi*zero(xi), +, x) == 0

we can see the issue goes away vs the manual looped one:

julia> x = @MVector[NaN]
1-element MVector{1, Float64} with indices SOneTo(1):
 NaN

julia> allfinite_turbo(x)
true

julia> allfinite_turbo2(x)
false

and the speed difference:

julia> @btime allfinite_turbo(x) setup=(x=randn(1000));
  86.541 ns (0 allocations: 0 bytes)

julia> @btime allfinite_turbo2(x) setup=(x=randn(1000));
  156.029 ns (0 allocations: 0 bytes)

compared to some other strategies in this thread:

julia> f1(x)=all(isfinite,x); @btime f1(x) setup=(x=randn(1000));
  425.040 ns (0 allocations: 0 bytes)

julia> f2(x)=isfinite(sum(xi->xi*zero(xi),x)); @btime f2(x) setup=(x=randn(1000));
  303.120 ns (0 allocations: 0 bytes)

julia> f3(x)=isfinite(sum(x)); @btime f3(x) setup=(x=randn(1000));
  229.855 ns (0 allocations: 0 bytes)

(Note that f3 is unsafe in case of overflow)

1 Like