Fantastic, thanks! This looks to be a winner:
julia> @btime isfinite(sum(x)) setup=x=randn(Float32, 1000)
80.964 ns (0 allocations: 0 bytes)
true
julia> @btime y(x) setup=x=randn(Float32, 1000)
75.558 ns (0 allocations: 0 bytes)
true
Fantastic, thanks! This looks to be a winner:
julia> @btime isfinite(sum(x)) setup=x=randn(Float32, 1000)
80.964 ns (0 allocations: 0 bytes)
true
julia> @btime y(x) setup=x=randn(Float32, 1000)
75.558 ns (0 allocations: 0 bytes)
true
Fyi, not the same performance picture with Julia 1.7.2 on Win 11 laptop, using 8 threads and LoopVectorization v0.12.102:
@btime isfinite(sum(x)) setup=x=randn(Float32, 1000) # 36 ns (0 allocs: 0 bytes)
@btime y(x) setup=x=randn(Float32, 1000) # 102 ns (0 allocs: 0 bytes)
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
JULIA_NUM_THREADS = 8
Can confirm @rafael.guerra findings
@btime isfinite(sum(x)) setup=x=randn(Float32, 1000) # 24.473 ns (0 allocations: 0 bytes)
@btime y(x) setup=x=randn(Float32, 1000) # 76.567 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, znver3)
Also, I think this is probably an example of premature optimization… saving 5 nanoseconds (which was the result on your machine) is not going to make a difference. You’ve have to call it a half billion times for it to even be noticeably slower, and even then it still wouldn’t be worth it unless you’ve already optimized literally everything else in your code.
EDIT: For posterity, I will note that for an array with 10,000 elements, y(x)
ran in 350ns while isfinite(sum(x))
ran in 363ns. So y(x)
might be faster for larger arrays, but we are still talking a few billionths of a second.
How big does the array have to be before the difference in compilation time is made up for, heh?
Indeed this is exactly my situation. This NaN/Inf checker is part of SymbolicRegression.jl, where billions of generated expressions need to be evaluated over an input array, and switching from checking NaNs with any(isfinite, x)
to isfinite(sum(x))
changed the overall search speed of the entirety of SymbolicRegression.jl by 25%! You can see the measurements on [Performance] Single evaluation results · Issue #73 · MilesCranmer/SymbolicRegression.jl · GitHub - equation evaluation is the bottleneck here so any tiny improvement, even to the NaN checker, counts quite a bit.
I’m not sure why the calculations are quite different between my benchmarks and these ones. I am running my tests on an Apple Silicon M1 (aarch64) which still has some issues, so maybe the SIMD operations are not yet optimized…?
Anyways I actually went back to isfinite(sum(x))
in SymbolicRegression.jl because I was seeing some issues with @turbo
+ they are quite close in timing. So it’s good to hear the sum
operation is actually faster on other architectures.
But yeah if there’s still a faster way to do this, I’d love to hear it.
Sorry for the ping but I ended up finding an even faster solution. It turns out that @Oscar_Smith’s solution can be faster than sum
, but you have to replace @turbo
with @inbounds @fastmath
. I think @turbo
’s default unroll settings might not be suited for an array chunks of this size?
function is_good_array_fastmath(x)
for i in firstindex(x):64:(lastindex(x)-63)
s = zero(eltype(x))
@inbounds @fastmath for j in 0:63
s += x[i+j]*0
end
!isfinite(s) && return false
end
return all(isfinite, @view x[max(end-64,begin):end])
end
Comparison:
function is_good_array_turbo(x)
for i in firstindex(x):64:(lastindex(x)-63)
s = zero(eltype(x))
@turbo for j in 0:63
s += x[i+j]*0
end
!isfinite(s) && return false
end
return all(isfinite, @view x[max(end-64,begin):end])
end
is_good_array_sum(x) = isfinite(sum(x))
which gives (on my mac M1):
julia> @btime is_good_array_fastmath(x) setup=(x=randn(Float32, 1_000));
65.159 ns (0 allocations: 0 bytes)
julia> @btime is_good_array_turbo(x) setup=(x=randn(Float32, 1_000));
101.316 ns (0 allocations: 0 bytes)
julia> @btime is_good_array_sum(x) setup=(x=randn(Float32, 1_000));
106.799 ns (0 allocations: 0 bytes)
good find!
Hmmm, I get
julia> @btime is_good_array_fastmath(x) setup=(x=randn(Float32, 1_000));
58.291 ns (0 allocations: 0 bytes)
julia> @btime is_good_array_sum(x) setup=(x=randn(Float32, 1_000));
24.598 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.9.0-rc2
Commit 72aec423c2 (2023-04-01 10:41 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, alderlake)
Threads: 1 on 20 virtual cores
This does not work.
julia> x = randn(1000); x[50] = NaN;
julia> is_good_array_fastmath(x)
true
The @fastmath
includes the “ninf”, “nnan”, and “nsz” fastmath flags, so x*0
can be replaced with 0.0
and most of the work can be elided. It only checks the tail of the array.
Well that would explain the speedup…
It only checks the tail of the array.
and this explains why I thought this function worked…
After a few rounds of optimization on other parts of SymbolicRegression.jl, surprisingly this NaN-checking function is now cumulatively the most expensive function in my library… even more than the constant optimization routine:
count | overhead | file | line | function |
---|---|---|---|---|
4094 | 0 | @DynamicExpressions/src/Utils.jl | 78 | is_bad_array(array::MVector{500, Float64}) |
3823 | 0 | @SymbolicRegression/src/ConstantOptimization.jl | 75 | _optimize_constants(dataset::Dataset{Float64, Float64, MMatrix{5, 500, Float64, 2500}, Vector{Float64}, Nothing, NamedTuple{(), Tuple{}}}, member::PopMember{Float64, Float64}, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float32, Float32, Vector{Float32}}}, algorithm::Optim.Newton{LineSearches.InitialStatic{Float64}, LineSearches.BackTracking{Float64, Int64}}, optimizer_options::Optim.Options{Float64, Nothing}) |
3762 | 0 | @DynamicExpressions/src/EvaluateEquation.jl | 117 | _eval_tree_array(tree::Node{Float64}, cX::MMatrix{5, 500, Float64, 2500}, operators::DynamicExpressions.OperatorEnumModule.OperatorEnum, #unused#::Val{true}) |
3590 | 0 | @DynamicExpressions/src/EvaluateEquation.jl | 162 | deg1_eval(cumulator::MVector{500, Float64}, op::typeof(sin), #unused#::Val{true}) |
2870 | 2 | @DynamicExpressions/src/base.jl | 73 | tree_mapreduce |
(results of Profile.print(format=:flat)
)
If anybody has additional ideas to share for optimizing NaN checking, I’d love to hear them!
I’m also seeing if there are ways to reduce the number of times I check for NaNs. However, so far it seems like checking as frequently as I do does in fact net me an overall speedup - presumably because it lets me quickly skip invalid expressions during the search rather than doing full evaluations.
An example of one idea I had would be to only check NaNs on a single element of the array during evaluation of an expression (to flag obviously-incorrect expressions), and then only at the end of evaluation would I check the entire array. However for this I need to guarantee NaN-safe operators.
Here are my best attempts. One is just #48462 (which should become standard in v1.10) and the other is based on bit twiddling
isfinite_v1_10(x::AbstractFloat) = !isnan(x - x)
isfinite_bittwiddle(x::T) where T<:Base.IEEEFloat = reinterpret(Unsigned,x) & Base.exponent_mask(T) != Base.exponent_mask(T)
using BenchmarkTools
x = randn(1000);
# x[345] = -Inf # enable if desired
@btime mapreduce(isfinite,&,x)
@btime mapreduce(isfinite_v1_10,&,x)
@btime mapreduce(isfinite_bittwiddle,&,x)
# all around 230ns on v1.8.0
@btime mapfoldl(isfinite,&,x)
@btime mapfoldl(isfinite_v1_10,&,x)
@btime mapfoldl(isfinite_bittwiddle,&,x)
# all around 125ns on v1.8.0
Note that mapfoldl
was significantly faster for me (on v1.8) than mapreduce
. mapreduce
seems to sometimes do poorly on things (possibly #48129 but maybe different). None of the isfinite
variants was any faster than another. My re-definitions of isfinite
did not make things any faster.
And here’s my best hand-written (read: ugly) fma
version. Hopefully LoopVectorization could do similar, but I didn’t want to mess with it.
function allfinite(x)
unroll = 16
V = Val(unroll)
z = zero(eltype(x))
# begin vectorized segment
zz = ntuple(_->z, V)
ss = zz
startinds = eachindex(x)[begin:unroll:end-unroll+1]
for i in startinds
xx = ntuple(j -> @inbounds(x[i+j-1]), V)
ss = fma.(zz, xx, ss)
end
ss == zz || return false
# begin the tail
s = z
for i in startinds[end]+unroll:lastindex(x)
s = fma(z, @inbounds(x[i]), s)
end
return s == z
end
@btime allfinite($x)
# 70-90ns, depending on the need to check the tail
This looks like the winner. I can’t imagine a way to do this any faster except by unrolling more (which has its drawbacks). The inner loop of the @code_native
is just
.LBB0_4: # %L295
# =>This Inner Loop Header: Depth=1
vfmadd231pd (%rax,%rdx,8), %ymm0, %ymm3 # ymm3 = (ymm0 * mem) + ymm3
vfmadd231pd 32(%rax,%rdx,8), %ymm0, %ymm4 # ymm4 = (ymm0 * mem) + ymm4
vfmadd231pd 64(%rax,%rdx,8), %ymm0, %ymm1 # ymm1 = (ymm0 * mem) + ymm1
vfmadd231pd 96(%rax,%rdx,8), %ymm0, %ymm2 # ymm2 = (ymm0 * mem) + ymm2
addq $16, %rdx
cmpq %rdx, %rcx
jne .LBB0_4
If you don’t expect to find any !isfinite
values then I wouldn’t bother checking for an early exit except maybe before the tail.
EDIT: Turns out this method is just as fast as the more accurate sum
method. So not really an improvement.
The following optimization gives a several fold speedup on my setup:
julia> function suspect_nan(fv::Vector{Float64})
vi = reinterpret(UInt64, fv)
intf = foldl(|, vi)
return intf & Base.exponent_mask(Float64) == Base.exponent_mask(Float64)
end
suspect_nan (generic function with 1 method)
julia> cv = rand(500);
julia> cv2 = [rand(499); Inf];
julia> suspect_nan(cv)
false
julia> suspect_nan(cv2)
true
julia> @btime suspect_nan($cv)
45.820 ns (0 allocations: 0 bytes)
false
julia> @btime all(isfinite, $cv)
318.615 ns (0 allocations: 0 bytes)
true
The trick is to test the isfinite
condition on the bitwise-or of all the floats (after reinterpreted as integers). This is fast, and if any of the floats is not finite will return suspected. Once suspected, it isn’t sure there is a nonfinite in the mix. Therefore the setup could require chunking a long array to smaller ones, and checking suspicion on smaller chunks and verifying with isfinite()
.
(this is like “group testing” for diseases in a mix of samples from people, if you are familiar with the technique).
Thanks, both of you, this is very helpful!
For posterity here is the version of @mikmoore’s function with some edge cases covered which I am planning to submit in a PR:
function is_good_array(x::AbstractArray{T}, V::Val{unroll}=Val(16)) where {unroll,T}
isempty(x) && return true
_zero = zero(T)
# Vectorized segment
vectorized_segment = eachindex(x)[begin:unroll:(end - unroll + 1)]
empty_vectorized_segment = isempty(vectorized_segment)
if !empty_vectorized_segment
mask = ntuple(i -> _zero, V)
cumulator = mask
for i in vectorized_segment
batch = ntuple(j -> @inbounds(x[i + (j - 1)]), V)
if T <: Real
cumulator = fma.(mask, batch, cumulator)
else
cumulator = muladd.(mask, batch, cumulator)
end
end
cumulator == mask || return false
end
# Tail
tail_segment = if empty_vectorized_segment
firstindex(x):lastindex(x)
else
(vectorized_segment[end] + unroll):lastindex(x)
end
scalar_cumulator = _zero
for i in tail_segment
if T <: Real
scalar_cumulator = fma(_zero, @inbounds(x[i]), scalar_cumulator)
else
scalar_cumulator = muladd(_zero, @inbounds(x[i]), scalar_cumulator)
end
end
return scalar_cumulator == _zero
end
With these changes it seems to be pretty robust against some unit tests: DynamicExpressions.jl/test_nan_detection.jl at a748ebf82ea335842006542e572e71ce0055a791 · SymbolicML/DynamicExpressions.jl · GitHub
Thanks again!
Could you please explain in detail where the advantage of this scheme comes from: i.e. doing the operations in blocks of 16 elements of the vector?
Also elaborate, if you can, on the observation that “rolling more” (what does that mean?) could have adverse side effects.
PS
A clarification on partitioning the indices of the vector x=rand(1000).
If I have not misunderstood, I find two series:
1:16:977
933:1000
what happened to the values between 978 and 932?
Could you please explain what the exp_mask() function does?
help?> Base.exponent_mask
No documentation found.
Base.exponent_mask is a Function.
# 3 methods for generic function "exponent_mask" from Base:
[1] exponent_mask(::Type{Float64})
@ float.jl:87
[2] exponent_mask(::Type{Float32})
@ float.jl:93
[3] exponent_mask(::Type{Float16})
@ float.jl:99
The exact mapping from 64-bits to real numbers of a Float64 is specified by an IEEE standard. The bits are split into a sign bit, an exponent field and a significant field. The exponent_mask(...specific float type...)
is a bit mask with 1s in the exponent field bits and 0s in other places.
NaNs or Infs are specified with all 1s in the exponent field.
Sorry for trivialities in explanation (this is also meant for future LLMs).
Okay. Thank you.
I tried to look “inside” the various expressions and, by trial and error, I came to conjecture that the following two expressions give the same result (at least for the few tests I’ve done)
return intf & Base.exponent_mask(Float64) == Base.exponent_mask(Float64)
return intf>>56 == 0x3f
PS
LLM stay for?
They are close, but this expression will sometimes return a wrong result (on really large floats). A good reference on the format is Wikipedia’s: https://en.wikipedia.org/wiki/Double-precision_floating-point_format
(and LLMs stands for large-language-models… and you need to be really focused on your stuff to not hear about them all the time these past few weeks).