@fastmath for library functions? (2x slowdown of binary search because of NaN checks in Base)

Hello everyone,

I’ve benchmarked Julia’s implementation of binary search (searchsortedlast) to test whether Julia can match C performance (spoiler: it can). However, I found that isnan checks within isless lead to a 1.5–2x slowdown (depending on array size) when values from a sorted Vector{Float64} are searched in another sorted Vector{Float64}. When the array of search queries is not sorted, the slowdown due to isnan becomes small compared to the slowdown due to irregular memory access pattern, but it’s still there.

In the current stable release (1.8.5), these checks can be turned off by --mathmode=fast. However, in 1.9.0-beta3 this flag does nothing (#41638), and @fastmath doesn’t work either, because searchsortedlast is defined in Base.Sort, not in my script.

So, my question is: if I write my code with carefully avoiding division by zero and other bad things, and I don’t need accuracy within 1 ulp, how can I apply @fastmath to functions from other modules? Can we have something like @fastmath import Base: searchsortedlast? Or a way for package developers to explicitly declare certain functions as “fastmath-safe” (@fastmathsafe export func1, func2) so that the global --math-mode=fast flag will apply only to them?

My code (nonan.jl):

if "nonan" ∈ ARGS
    Base.isnan(x::T) where T <: AbstractFloat = false
end

using BenchmarkTools
import Random: seed!, shuffle

heading(str) = println("\n\n", str, "\n", "-"^length(str))

const T = Float64
const N = 2^20

Da = UInt32
seed!(collect(Da, "weed!"))

const arr = Vector{T}(sort(rand(1:N^2, N)))
const vals = Vector{T}(sort(rand(1:N^2, N)))
const slav = shuffle(vals)

function predecessor(x, arr)
    @fastmath searchsortedlast(arr, x)
end

const Nsamples = 10

function run_benchmarks()
    heading("Sorted queries")
    benchmark = @benchmark [predecessor(val, $arr) for val in $vals] samples=Nsamples
    println(IOContext(stdout, :compact => false), benchmark)

    heading("Shuffled queries")
    mbecnhkra = @benchmark [predecessor(lav, $arr) for lav in $slav] samples=Nsamples
    println(IOContext(stdout, :compact => false), mbecnhkra)
end

run_benchmarks()

With Julia 1.8.5, julia nonan.jl outputs

Sorted queries
--------------
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  125.497 ms … 154.638 ms  ┊ GC (min … max): 0.00% … 0.18%
 Time  (median):     138.220 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   138.085 ms ±  11.044 ms  ┊ GC (mean ± σ):  0.02% ± 0.06%

  █   █                   ▁   ▁            ▁    ▁     ▁       ▁  
  █▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁█ ▁
  125 ms           Histogram: frequency by time          155 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.


Shuffled queries
----------------
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  553.379 ms … 619.787 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     571.075 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   573.963 ms ±  19.642 ms  ┊ GC (mean ± σ):  0.01% ± 0.02%

  █ █    █       ██    ██ █                                   █  
  █▁█▁▁▁▁█▁▁▁▁▁▁▁██▁▁▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  553 ms           Histogram: frequency by time          620 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia --math-mode=fast nonan.jl outputs

Sorted queries
--------------
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  65.392 ms … 71.410 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     70.893 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   68.890 ms ±  2.780 ms  ┊ GC (mean ± σ):  0.03% ± 0.10%

  ▁▁▁   ▁                                              ██ ▁ ▁  
  ███▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁█ ▁
  65.4 ms         Histogram: frequency by time        71.4 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.


Shuffled queries
----------------
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  490.365 ms … 517.467 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     504.671 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   502.479 ms ±  10.312 ms  ┊ GC (mean ± σ):  0.01% ± 0.02%

  █ ▁    ▁                ▁              ▁    █   ▁           ▁  
  █▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█ ▁
  490 ms           Histogram: frequency by time          517 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia nonan.jl nonan (disabling isnan checks manually) gives similar results: 82 ms (median) for sorted queries and 494 ms for shuffled queries.

For 1.9.0-beta3, the results are similar except that --math-mode=fast doesn’t work.

Summary of median times:

Sorted Shuffled
1.8.5
julia nonan.jl 138 571
julia --math-mode=fast nonan.jl 71 505
julia nonan.jl nonan 82 494
1.9.0-beta3
julia nonan.jl 144 655
julia --math-mode=fast nonan.jl 144 638
julia nonan.jl nonan 71 582
4 Likes

I think it would be possible to make the @fastmath macro somewhat user extensible, but it’s not entirely trivial and you would end up with potentially weird world age issues if you did so. for searchsortedlast particularly, I think you could recover the performance by using a @fastmathed operator for the comparison.

2 Likes

I think the piracy in the start is part of the problem - I get these values without that and passing @fastmath(<) for the lt (lessthan) function that searchsortedlast uses:

julia> @benchmark [searchsortedlast(arr, v; lt=@fastmath(<)) for v in vals] setup=(N=2^20; arr=sort!(rand(1:N^2, N)); vals=sort!(rand(1:N^2, N))) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  21.465 ms …  22.156 ms  ┊ GC (min … max): 0.00% … 1.67%
 Time  (median):     21.507 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.579 ms ± 208.001 μs  ┊ GC (mean ± σ):  0.17% ± 0.53%

  ▃█   ▃                                                        
  ██▁▁▁█▁▁▁▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  21.5 ms         Histogram: frequency by time         22.2 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @fastmath(<)
lt_fast (generic function with 4 methods)

julia> @benchmark [searchsortedlast(arr, v; lt=<) for v in vals] setup=(N=2^20; arr=sort!(rand(1:N^2, N)); vals=sort!(rand(1:N^2, N))) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  22.096 ms …  22.460 ms  ┊ GC (min … max): 0.00% … 1.50%
 Time  (median):     22.123 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.174 ms ± 109.334 μs  ┊ GC (mean ± σ):  0.15% ± 0.47%

  ▃   █                                                         
  █▁▁▇█▁▁▁▁▁▁▁▁▇▁▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  22.1 ms         Histogram: frequency by time         22.5 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark [searchsortedlast(arr, v; lt=@fastmath(<)) for v in vals] setup=(N=2^20; arr=sort!(rand(1:N^2, N)); vals=rand(1:N^2, N)) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  109.495 ms … 120.874 ms  ┊ GC (min … max): 0.32% … 0.30%
 Time  (median):     116.532 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   115.484 ms ±   3.093 ms  ┊ GC (mean ± σ):  0.09% ± 0.15%

  ▁                ▁  ▁       ▁       ▁ ▁█▁                   ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  109 ms           Histogram: frequency by time          121 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

julia> @benchmark [searchsortedlast(arr, v; lt=<) for v in vals] setup=(N=2^20; arr=sort!(rand(1:N^2, N)); vals=rand(1:N^2, N)) samples=10
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  111.870 ms … 123.802 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     119.508 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   118.725 ms ±   3.647 ms  ┊ GC (mean ± σ):  0.09% ± 0.15%

                    ▃                    █                       
  ▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇ ▁
  112 ms           Histogram: frequency by time          124 ms <

 Memory estimate: 8.00 MiB, allocs estimate: 2.

So no performance difference at all.

julia> versioninfo()
Julia Version 1.10.0-DEV.402
Commit 4cab76ce52 (2023-01-19 20:38 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 24 on 24 virtual cores
Environment:
  JULIA_NUM_THREADS = 24
2 Likes

There isn’t a such thing as a @fastmath implementation of <. Comparisons are single native instructions and there’s no way to make them faster than that. You are, however, correct that < is faster than the default isless comparison operator if your inputs don’t include NaN and you don’t care about the ordering of -0.0 vs +0.0. isless is different from < in that it implements -0.0 < +0.0 and +Inf < NaN. Using < is the way to make a “fast” sort/searchsorted/etc, you don’t need a @fastmath flag. While it’s possible complex @fastmath interactions with these sorts of functions might be added in the future, I wouldn’t expect it soon.

julia> using BenchmarkTools

julia> x = sort!(rand(1000));

julia> @benchmark searchsortedlast($x,v;lt=isless) setup=(v=rand())
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  16.032 ns … 72.846 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     17.836 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.237 ns ±  3.048 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▂▁  ▂█▇▅ ▁                                                 ▁
  ███▄▁████▇███▅▅▁▃▃▁▃▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▃▁▁▁▁▁▁▁▅▅▄▄ █
  16 ns        Histogram: log(frequency) by time      32.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark searchsortedlast($x,v;lt=<) setup=(v=rand())
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.407 ns … 49.149 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.709 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.931 ns ±  2.033 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁ ▂▁   ▁▂    ▁ ▄ ▆▇ █ ▇▆ ▄ ▅▅ ▃ ▂  ▂ ▁                    ▂
  ▇█▁██▁█▁██▁▇▁▆█▁█▁██▁█▁██▁█▁██▁█▁██▁█▁██▁▇▁▇▆▁▇▁▅▅▁▅▁▄▅▁▄▄ █
  7.41 ns      Histogram: log(frequency) by time     10.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Yes, I should have used isless in my comparison, you’re right.

Yes there is, surprisingly enough (though it doesn’t make a difference in the end for this):

julia> @fastmath(<)
lt_fast (generic function with 4 methods)

julia> @fastmath(<) |> methods
# 4 methods for generic function "lt_fast" from Base.FastMath:
 [1] lt_fast(x::T, y::T) where T<:Union{Float16, Float32, Float64}
     @ fastmath.jl:185
 [2] lt_fast(x::T, ys::T...) where T<:Number
     @ fastmath.jl:274
 [3] lt_fast(x::Number, y::Number, zs::Number...)
     @ fastmath.jl:271
 [4] lt_fast(xs...)
     @ fastmath.jl:269

which ends up mapping to the lt_float_fast intrinsic in src/intrinsics.cpp:1467:

    case lt_float_fast: *newtyp = jl_bool_type; return math_builder(ctx, true)().CreateFCmpOLT(x, y);

giving the semantic guarantee for floating point types that those instructions are emitted. Regular < does not have that guarantee - it calls Intrinsics.lt_float, which just happens to be mapped to the same intrinsic via an alias in src/intrinsics.h:45:

    ALIAS(lt_float_fast, lt_float) \

I don’t know why it was implemented this way :person_shrugging:

2 Likes

This function does make a big difference in some cases.

julia> x = rand(512);

julia> @btime minimum($x)
  848.195 ns (0 allocations: 0 bytes)
0.0004658392690449764

julia> @btime reduce((x,y) -> x<y ? x : y, $x, init=Inf)
  395.203 ns (0 allocations: 0 bytes)
0.0004658392690449764

julia> @btime reduce((x,y) -> @fastmath(x < y) ? x : y, $x, init=Inf)
  33.046 ns (0 allocations: 0 bytes)
0.0004658392690449764

@fastmath isn’t just about swapping out one implementation for another, but about communicating assumptions to the compiler about transformations it is allowed to apply.
In this case, it tells the compiler it is allowed to do a SIMD reduction.

Doing so risks missing NaN values, but if we promise no-nans, then reordering this reduction sequence is suddenly legal.

9 Likes

Very nice. You are, of course, correct that @fastmath does enable the assumptions of no signed zeros and no nans, under which x<y ? x : y becomes associative. That said, I can’t reproduce that speedup on my x86-64 running v1.8.0 (it runs at the same slower speed) but maybe it works better in a later version.

On the other hand, reduce makes no guarantee of the reduction order so @fastmath shouldn’t be required for SIMD here. However, in most cases (including this one) it fails to exploit this freedom and simply punts to foldl. Related to issue #48129, which I’ve augmented with this example.

1 Like

This requires at least 1.9; I also get the same (bad) performance on 1.8 whether or not I am using @fastmath.
I’m not sure what the problem was; it is clearly adding the fast flag.

julia> @code_llvm debuginfo = :none Base.FastMath.lt_fast(3.4, 5.4)
define i8 @julia_lt_fast_1200(double %0, double %1) #0 {
top:
  %2 = fcmp fast olt double %0, %1
  %3 = zext i1 %2 to i8
  ret i8 %3
}

I agree that it’d be nice if reduce didn’t need @fastmath/shouldn’t have to punt to foldl.

It may be related to Define `gt_fast` and `ge_fast` by chriselrod · Pull Request #47972 · JuliaLang/julia · GitHub, which you opened after we discovered the problems with SIMDing on slack, while trying to make my threaded code SIMD better :slight_smile: Incidentally, that PR is now also the reason my code is too fast now, but that’s a good problem to have.

reduce(@fastmath(max), $x, init=Inf) is about as fast. On master, @fastmath maximum is too.

1 Like

Sorry if this is a dumb question, but can someone explain what’s going on here?

julia> let x=rand(10_000), y=[x; -x], s=Vector{Float64}(undef, 6);
           s[1] = @btime sum($y)
           s[2] = @btime reduce(+, $y)
           s[3] = @btime reduce(+, $y, init=0.0)
           s[4] = @btime reduce(@fastmath(+), $y, init=0.0)
           s[5] = @btime foldl(+, $y)
           s[6] = @btime foldl(@fastmath(+), $y)
           s
       end
  1.710 μs (0 allocations: 0 bytes)
  1.710 μs (0 allocations: 0 bytes)
  10.200 μs (0 allocations: 0 bytes)
  1.710 μs (0 allocations: 0 bytes)
  10.200 μs (0 allocations: 0 bytes)
  1.710 μs (0 allocations: 0 bytes)
6-element Vector{Float64}:
  0.0
  0.0
 -2.850042424284993e-11
  3.049782648645305e-13
 -2.850042424284993e-11
  3.049782648645305e-13

Due to limited precision, floating point math is not associative. Obviously, division and subtraction never are (a / b) / c != a / (b / c) but for floating point numbers this extends to addition and multiplication (which are usually associative in infinite precision). This is why not every result is zero. The ones that are zero just got lucky, it’s not that they’re intrinsically more accurate.

What you’re seeing here can be reproduced by a simpler example

julia> ((0.1 + 0.2) - 0.1) - 0.2
2.7755575615628914e-17

This isn’t to say floating point math is randomly noisy. Every individual operation here is correctly rounded (the result is the closest representable number) but nevertheless the result is “incorrect” relative to if this operation had been done in infinite precision. So while floating point math isn’t random, it does accumulate round-off error.

You’ll notice that all those expressions are computed at one of two speeds (1.7us or 10.2us). The difference is that the fast ones are using re-association while the slow ones are using a strict left-to-right association ((a + b) + c) + d.... By re-associating the sum to something like (a + b) + (c + d) + ... (although in practice it won’t look exactly like that), the computer can use SIMD instructions to parallelize the computation. It also makes it easier and more worthwhile for the compiler to perform loop unrolling. These two factors, taken together, are what gives you the 6x increase in performance.

So why doesn’t the compiler re-associate these operations all the time? After all, it’s much faster and the result is subject to round-off error anyway. Because the rules of floating point arithmetic are very strict and the compiler obeys them absolutely. It does this because it’s very important to some computations that the compiler obey these rules exactly. However, the internals of reduce(+,...) or sum, or the use of @fastmath, explicitly tell the compiler that it’s okay to re-associate the operations for speed (warning: @fastmath can also enable a bunch of other stuff so use it very carefully).

foldl explicitly obeys left-to-right associativity, but with @fastmath the compiler sees through this and changes it anyway. reduce and sum do not have any specific guaranteed associativity, but we currently only exploit this in a narrow range of cases (see #48129) and the init keyword appears to be breaking it out of those special cases.

4 Likes

foldl docs state this (emph mine): “Like reduce, but with guaranteed left associativity.” I would’ve thought this “guarantee” should be overriding :sweat_smile: but I guess the compiler flag overrides the order of the Julia code.

One other thing to notice is that the residue takes on one of three values: one for left-associative [slow] addition, one for non-fastmath re-associative addition, and yet another for fastmath addition. Is fastmath addition simply choosing a different (but comparable) execution order?

This is what I figured, especially since I recall getting a nonzero value once. But on run after run after run, I get exactly 0.0 for them. They seem improbably lucky. Infact, when I try it again, I get this (comparison: sum vs reduce(+, _) vs foldl(+, _), ten thousand runs):

julia> using UnicodePlots

julia> histogram(sum.((x=rand(10_000); [x; -x]) for _=1:10_000))
              ┌                                        ┐
   [0.0, 1.0) ┤████████████████████████████████  10 000
              └                                        ┘
                               Frequency

julia> histogram(reduce.(+, (x=rand(10_000); [x; -x]) for _=1:10_000))
              ┌                                        ┐
   [0.0, 1.0) ┤████████████████████████████████  10 000
              └                                        ┘
                               Frequency

julia> histogram(foldl.(+, (x=rand(10_000); [x; -x]) for _=1:10_000))
                        ┌                                        ┐
   [-8.0e-11, -7.0e-11) ┤▏ 1
   [-7.0e-11, -6.0e-11) ┤▎ 9
   [-6.0e-11, -5.0e-11) ┤▊ 47
   [-5.0e-11, -4.0e-11) ┤██▍ 142
   [-4.0e-11, -3.0e-11) ┤██████▍ 392
   [-3.0e-11, -2.0e-11) ┤█████████████▊ 857
   [-2.0e-11, -1.0e-11) ┤███████████████████████▊ 1 479
   [-1.0e-11,  0.0    ) ┤████████████████████████████████▎ 1 992
   [ 0.0    ,  1.0e-11) ┤█████████████████████████████████  2 044
   [ 1.0e-11,  2.0e-11) ┤████████████████████████▋ 1 533
   [ 2.0e-11,  3.0e-11) ┤█████████████▊ 861
   [ 3.0e-11,  4.0e-11) ┤███████▏ 434
   [ 4.0e-11,  5.0e-11) ┤██▋ 162
   [ 5.0e-11,  6.0e-11) ┤▋ 39
   [ 6.0e-11,  7.0e-11) ┤▎ 8
                        └                                        ┘
                                         Frequency

julia> sum(x->x ≡ 0.0, sum.((x=rand(10_000); [x; -x]) for _=1:10_000))
10000

julia> sum(x->x ≡ 0.0, reduce.(+, (x=rand(10_000); [x; -x]) for _=1:10_000))
10000

julia> sum(x->x ≡ 0.0, foldl.(+, (x=rand(10_000); [x; -x]) for _=1:10_000))
0

I am scratching my head very, very vigorously.

This is just one of the reasons that @fastmath should only be used in tightly controlled situations. Julia guaranteed that operation would be constructed left-to-right, but the compiler (with its license to @fastmath) gets the final say. For what it’s worth, there’s almost nothing @fastmath can do that you can’t match manually – although often it can be tedious to do so.

Julia’s reduce (and sum(x) is essentially just reduce(Base.add_sum,x) where add_sum is + with some special handling for certain types) uses a version of pairwise reduction with a large base case (1024 in v1.8.0 for most operations).

This means that any input larger than the base case gets recursively split in half until it’s no bigger than the base case. So for sufficiently long x, the first step of sum([x;-x]) is sum(x) + sum(-x), and from here the zero result should be obvious. You’d get a nonzero result if you used a shorter x or scrambled the input (try Random.shuffle).

Of course, the pairwise summation is an implementation detail and should not be relied upon as it is subject to change in details or entirety.

Obviously, foldl is not using the recursive subdivision. I’m suspecting that reduce with the init keyword is not doing so either. So your different results come from three different orders:

  1. Strict left-to-right
  2. Re-associated
  3. Pairwise-subdivided re-associated
1 Like

Thanks, this makes sense now! :sunglasses: