Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance?

xiaodai · October 20, 2024, 11:22am

using BenchmarkTools
using LoopVectorization

function a1(x)
    y = zero(eltype(x))
    for i in eachindex(x)
        y += x[i]
    end
    y
end

function a2(x)
    y = zero(eltype(x))
    @simd for i in eachindex(x)
        y += x[i]
    end
    y
end

function a3(x)
    y = zero(eltype(x))
    @inbounds @fastmath for i in eachindex(x)
        y += x[i]
    end
    return y
end

function a4(x)
    y = zero(eltype(x))
    @inbounds @fastmath @simd for i in eachindex(x)
        y += x[i]
    end
    return y
end


function a5(x)
    y = zero(eltype(x))
    @turbo for i in eachindex(x)
        y += x[i]
    end
    return y
end


x = rand(100_000_000);

@benchmark a1($x)
@benchmark a2($x)
@benchmark a3($x)
@benchmark a4($x)
@benchmark a5($x)

and I got these results which show that @simd and @inbounds @fastmath are similar and don’t really add much to it. Perhaps, the compiler was smart enough to compile a few things away or the algorithm doesn’t lend well to SIMD. So not really use what’s the best general approach to speed up array processing code. My real code is a lot more complicated and involves calling some functions the in the reduce step

Summary

BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   98.881 ms … 108.322 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     100.404 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   101.257 ms ±   2.343 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       █
  ▄▄▅█▅█▅▄▄▇▇▄▄▁▄▄▇▄▄▁▁▄▁▁▄▁▁▁▄▄▄▁▄▁▄▁▁▁▁▄▁▁▁▁▁▁▁▄▁▁▁▄▁▁▁▁▁▁▁▄▄ ▁
  98.9 ms          Histogram: frequency by time          108 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 141 samples with 1 evaluation.
 Range (min … max):  34.740 ms … 40.694 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.185 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.555 ms ±  1.002 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂█▄▄
  ██████▇▆▃▅▅▄▁▅▃▃▃▅▁▁▃▃▃▃▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  34.7 ms         Histogram: frequency by time        40.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 138 samples with 1 evaluation.
 Range (min … max):  34.956 ms … 43.002 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.081 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.368 ms ±  1.128 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁   ▃ █▄▅▂
  ▅▅▃█▆▅▅█▅████▃▆▆▆▅▁▄▆▅▅▃▄▄▃▄▃▁▁▁▄▁▁▁▁▁▃▁▁▃▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  35 ms           Histogram: frequency by time        40.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 140 samples with 1 evaluation.
 Range (min … max):  34.972 ms … 45.533 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.587 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.909 ms ±  1.382 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂█▄▄ ▁
  ▅████▆█▄▃▃▂▃▂▃▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  35 ms           Histogram: frequency by time        44.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 140 samples with 1 evaluation.
 Range (min … max):  34.838 ms … 41.175 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.573 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.935 ms ±  1.116 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▁▁█▂▂▁
  ▇█████████▆▅▄▄▃▄▅▁▄▄▁▃▄▃▃▃▁▄▃▁▃▄▃▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▃
  34.8 ms         Histogram: frequency by time        40.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

giordano · October 20, 2024, 2:15pm

@fastmath has no effect on code inside function calls (it’s only rewrites some of the code it sees, it can’t change code it doesn’t see), and @simd is only a hint for the compiler to try harder to vectorise your loop, but if the body of the loop is too complicated, it won’t do anything anyway. Also, @simd should only be used only when the conditions detailed in its docstring are satisfied, @fastmath is fairly useless with a bunch of functions and its entire point is to trade off accuracy with speed (if you care about accuracy you may reconsider using it), and also @inbounds should only ever used if you’re 100% sure that you’re accessing the elements within the bounds of the array.

So no, no one should suggest to always blindly use those macros without understanding their implications, just for the sake of maybe going faster.

abraemer · October 20, 2024, 7:59pm

Also eachindex(x) implies that the accesses to x with that index are inbounds and it communictaes this to the compiler. So @inbounds should completely irrelevant here.

Elrod · October 21, 2024, 4:16am

N=10^8 is memory bound. Try on a bunch of random sizes from 1:512 or so.
E.g.

using Random
N = 512
p = Random.randperm(N);
x = rand(N);
y = similar(x);
function evalrandsizes(f, y, x, perm)
    for (i,n) = enumerate(perm)
        y[i] = @noinline f(@view(x[Base.oneto(perm[i])]))
    end
end
@btime evalrandsizes(a1, $y, $x, $p)
@btime evalrandsizes(a2, $y, $x, $p)
@btime evalrandsizes(a3, $y, $x, $p)
@btime evalrandsizes(a4, $y, $x, $p)
@btime evalrandsizes(a5, $y, $x, $p)

I get

julia> @btime evalrandsizes(a1, $y, $x, $p)
  112.720 μs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a2, $y, $x, $p)
  12.232 μs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a3, $y, $x, $p)
  11.825 μs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a4, $y, $x, $p)
  11.809 μs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a5, $y, $x, $p)
  7.776 μs (0 allocations: 0 bytes)

sumiya11 · October 21, 2024, 10:55am

This is memory bound. Try on a bunch of random sizes from 1:512 or so.

Could you clarify what you mean by memory bound? I think for N = 512 the arrays usually fit in first level cache.

EDIT: Ah, I see, you were probably referring to N = 10^8 in the OP.

mikmoore · October 21, 2024, 2:21pm

To use these tools effectively (and safely!) it’s important to know how they achieve improved performance in a given setting.

@inbounds removes bounds checks. In “simple indexing” cases (e.g., iterating a range of indices) the compiler is usually smart enough to do this itself. But @inbounds can be useful if you’re iterating through a set of saved indices that you know are inbounds but that the compiler probably doesn’t. For example, x[findall(>(0), x)] would probably need @inbounds to avoid bounds checks (although there are better ways to write that expression).

@fastmath is too dangerous to use on more than a few lines at a time, and in many cases one can achieve similar performance by writing the code carefully (though sometimes this is tedious). But it can be very convenient to allow algebraic rewrites of intermediate-complexity equations that are not numerically sensitive and don’t require Inf/NaN handling (@fastmath will usually discard such “frivolities,” which can lead to serious mistakes).

@simd permits just two of @fastmath’s transformations (the reassociating ones: reassoc and contract) and also gives a little code nudge towards SIMD. It’s much safer than @fastmath but somewhat less powerful.

In your case here, the speedup is due to @fastmath or @simd allowing the loop to be vectorized (to accumulate multiple parallel sums and then combine them at the end). Normally, the compiler would not permit this because the answer is different than the totally-serial sum (because floating point math is non-associative). Note that if you were summing Ints then the compiler would likely do this without any annotations at all (since Int is associative over + and *).

@inbounds has no impact here because your use of eachindex (good job) permits the compiler to prove that all accesses are inbounds. The compiler might even be smart enough to know, for example, that 1:length(x) are inbounds for x::Vector (but eachindex and related constructs are still to be preferred).

Elrod · October 21, 2024, 3:00pm

I edited for clarification.

jishnub · October 21, 2024, 3:45pm

This is not always the case, unfortunately. Only in simple cases can the compiler elide bounds-checks with eachindex. E.g.:

julia> using LinearAlgebra

julia> function f(D)
       s = zero(eltype(D))
       for i in eachindex(D)
           s += D[i]
       end
       s
       end
f (generic function with 1 method)

julia> D = Diagonal(1:3);

julia> @code_llvm f(D)

this indicates that the bounds checks are still present.

L28:                                              ; preds = %L65, %top
  %value_phi13 = phi i64 [ %value_phi51, %L65 ], [ 1, %top ]
  %value_phi14 = phi i64 [ %value_phi52, %L65 ], [ 1, %top ]
  %value_phi19 = phi i64 [ %9, %L65 ], [ 0, %top ]
;  @ REPL[2]:4 within `f`
; ┌ @ abstractarray.jl:1312 within `getindex`
; │┌ @ abstractarray.jl:1358 within `_getindex`
; ││┌ @ /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/diagonal.jl:177 within `getindex`
; │││┌ @ abstractarray.jl:699 within `checkbounds` @ abstractarray.jl:681
; ││││┌ @ abstractarray.jl:725 within `checkbounds_indices`
; │││││┌ @ abstractarray.jl:754 within `checkindex`
; ││││││┌ @ int.jl:86 within `-`
         %3 = add i64 %value_phi13, -1
; ││││││└
; ││││││┌ @ int.jl:513 within `<`
         %4 = icmp uge i64 %3, %1
; │││││└└
; │││││ @ abstractarray.jl:725 within `checkbounds_indices` @ abstractarray.jl:725
; │││││┌ @ abstractarray.jl:754 within `checkindex`
; ││││││┌ @ int.jl:86 within `-`
         %5 = add i64 %value_phi14, -1
; ││││││└
; ││││││┌ @ int.jl:513 within `<`
         %6 = icmp uge i64 %5, %1
; ││││└└└
; ││││ @ abstractarray.jl:699 within `checkbounds`
      %.not70 = or i1 %4, %6
      br i1 %.not70, label %L62, label %L65

L62:                                              ; preds = %L28
      %7 = getelementptr inbounds [2 x i64], ptr %"new::Tuple", i64 0, i64 1
; ││││ @ abstractarray.jl:697 within `checkbounds`
      store i64 %value_phi13, ptr %"new::Tuple", align 8
      store i64 %value_phi14, ptr %7, align 8
; ││││ @ abstractarray.jl:699 within `checkbounds`
      call void @j_throw_boundserror_1320(ptr nocapture nonnull readonly %"D::Diagonal", ptr nocapture nonnull readonly %"new::Tuple") #8
      unreachable

Topic		Replies	Views
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2865	July 16, 2018
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2090	August 28, 2021
@inbounds code slower than one without General Usage	17	2273	March 9, 2019
Help understanding vectorization (or lack thereof) Performance	15	1197	June 8, 2018
LoopVectorization almost doubles execution time? Performance loopvectorization	6	655	July 9, 2021

Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance?

Related topics