Is the triple `@inbounds @fastmath @simd` necessary for absolute peak performance?

using BenchmarkTools
using LoopVectorization

function a1(x)
    y = zero(eltype(x))
    for i in eachindex(x)
        y += x[i]
    end
    y
end

function a2(x)
    y = zero(eltype(x))
    @simd for i in eachindex(x)
        y += x[i]
    end
    y
end

function a3(x)
    y = zero(eltype(x))
    @inbounds @fastmath for i in eachindex(x)
        y += x[i]
    end
    return y
end

function a4(x)
    y = zero(eltype(x))
    @inbounds @fastmath @simd for i in eachindex(x)
        y += x[i]
    end
    return y
end


function a5(x)
    y = zero(eltype(x))
    @turbo for i in eachindex(x)
        y += x[i]
    end
    return y
end


x = rand(100_000_000);

@benchmark a1($x)
@benchmark a2($x)
@benchmark a3($x)
@benchmark a4($x)
@benchmark a5($x)

and I got these results which show that @simd and @inbounds @fastmath are similar and don’t really add much to it. Perhaps, the compiler was smart enough to compile a few things away or the algorithm doesn’t lend well to SIMD. So not really use what’s the best general approach to speed up array processing code. My real code is a lot more complicated and involves calling some functions the in the reduce step

Summary
BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):   98.881 ms … 108.322 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     100.404 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   101.257 ms Β±   2.343 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–ˆ
  β–„β–„β–…β–ˆβ–…β–ˆβ–…β–„β–„β–‡β–‡β–„β–„β–β–„β–„β–‡β–„β–„β–β–β–„β–β–β–„β–β–β–β–„β–„β–„β–β–„β–β–„β–β–β–β–β–„β–β–β–β–β–β–β–β–„β–β–β–β–„β–β–β–β–β–β–β–β–„β–„ ▁
  98.9 ms          Histogram: frequency by time          108 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 141 samples with 1 evaluation.
 Range (min … max):  34.740 ms … 40.694 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.185 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   35.555 ms Β±  1.002 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‚β–ˆβ–„β–„
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–ƒβ–…β–…β–„β–β–…β–ƒβ–ƒβ–ƒβ–…β–β–β–ƒβ–ƒβ–ƒβ–ƒβ–β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  34.7 ms         Histogram: frequency by time        40.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 138 samples with 1 evaluation.
 Range (min … max):  34.956 ms … 43.002 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     36.081 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   36.368 ms Β±  1.128 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     ▁   β–ƒ β–ˆβ–„β–…β–‚
  β–…β–…β–ƒβ–ˆβ–†β–…β–…β–ˆβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ƒβ–†β–†β–†β–…β–β–„β–†β–…β–…β–ƒβ–„β–„β–ƒβ–„β–ƒβ–β–β–β–„β–β–β–β–β–β–ƒβ–β–β–ƒβ–ƒβ–β–ƒβ–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  35 ms           Histogram: frequency by time        40.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 140 samples with 1 evaluation.
 Range (min … max):  34.972 ms … 45.533 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.587 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   35.909 ms Β±  1.382 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‚β–ˆβ–„β–„ ▁
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–„β–ƒβ–ƒβ–‚β–ƒβ–‚β–ƒβ–‚β–β–‚β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚ β–‚
  35 ms           Histogram: frequency by time        44.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

BenchmarkTools.Trial: 140 samples with 1 evaluation.
 Range (min … max):  34.838 ms … 41.175 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.573 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   35.935 ms Β±  1.116 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–β–ƒβ–β–β–ˆβ–‚β–‚β–
  β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–„β–…β–β–„β–„β–β–ƒβ–„β–ƒβ–ƒβ–ƒβ–β–„β–ƒβ–β–ƒβ–„β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„ β–ƒ
  34.8 ms         Histogram: frequency by time        40.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

@fastmath has no effect on code inside function calls (it’s only rewrites some of the code it sees, it can’t change code it doesn’t see), and @simd is only a hint for the compiler to try harder to vectorise your loop, but if the body of the loop is too complicated, it won’t do anything anyway. Also, @simd should only be used only when the conditions detailed in its docstring are satisfied, @fastmath is fairly useless with a bunch of functions and its entire point is to trade off accuracy with speed (if you care about accuracy you may reconsider using it), and also @inbounds should only ever used if you’re 100% sure that you’re accessing the elements within the bounds of the array.

So no, no one should suggest to always blindly use those macros without understanding their implications, just for the sake of maybe going faster.

16 Likes

Also eachindex(x) implies that the accesses to x with that index are inbounds and it communictaes this to the compiler. So @inbounds should completely irrelevant here.

2 Likes

N=10^8 is memory bound. Try on a bunch of random sizes from 1:512 or so.
E.g.

using Random
N = 512
p = Random.randperm(N);
x = rand(N);
y = similar(x);
function evalrandsizes(f, y, x, perm)
    for (i,n) = enumerate(perm)
        y[i] = @noinline f(@view(x[Base.oneto(perm[i])]))
    end
end
@btime evalrandsizes(a1, $y, $x, $p)
@btime evalrandsizes(a2, $y, $x, $p)
@btime evalrandsizes(a3, $y, $x, $p)
@btime evalrandsizes(a4, $y, $x, $p)
@btime evalrandsizes(a5, $y, $x, $p)

I get

julia> @btime evalrandsizes(a1, $y, $x, $p)
  112.720 ΞΌs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a2, $y, $x, $p)
  12.232 ΞΌs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a3, $y, $x, $p)
  11.825 ΞΌs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a4, $y, $x, $p)
  11.809 ΞΌs (0 allocations: 0 bytes)

julia> @btime evalrandsizes(a5, $y, $x, $p)
  7.776 ΞΌs (0 allocations: 0 bytes)

This is memory bound. Try on a bunch of random sizes from 1:512 or so.

Could you clarify what you mean by memory bound? I think for N = 512 the arrays usually fit in first level cache.

EDIT: Ah, I see, you were probably referring to N = 10^8 in the OP.

1 Like

To use these tools effectively (and safely!) it’s important to know how they achieve improved performance in a given setting.

@inbounds removes bounds checks. In β€œsimple indexing” cases (e.g., iterating a range of indices) the compiler is usually smart enough to do this itself. But @inbounds can be useful if you’re iterating through a set of saved indices that you know are inbounds but that the compiler probably doesn’t. For example, x[findall(>(0), x)] would probably need @inbounds to avoid bounds checks (although there are better ways to write that expression).

@fastmath is too dangerous to use on more than a few lines at a time, and in many cases one can achieve similar performance by writing the code carefully (though sometimes this is tedious). But it can be very convenient to allow algebraic rewrites of intermediate-complexity equations that are not numerically sensitive and don’t require Inf/NaN handling (@fastmath will usually discard such β€œfrivolities,” which can lead to serious mistakes).

@simd permits just two of @fastmath’s transformations (the reassociating ones: reassoc and contract) and also gives a little code nudge towards SIMD. It’s much safer than @fastmath but somewhat less powerful.

In your case here, the speedup is due to @fastmath or @simd allowing the loop to be vectorized (to accumulate multiple parallel sums and then combine them at the end). Normally, the compiler would not permit this because the answer is different than the totally-serial sum (because floating point math is non-associative). Note that if you were summing Ints then the compiler would likely do this without any annotations at all (since Int is associative over + and *).

@inbounds has no impact here because your use of eachindex (good job) permits the compiler to prove that all accesses are inbounds. The compiler might even be smart enough to know, for example, that 1:length(x) are inbounds for x::Vector (but eachindex and related constructs are still to be preferred).

2 Likes

I edited for clarification.

1 Like

This is not always the case, unfortunately. Only in simple cases can the compiler elide bounds-checks with eachindex. E.g.:

julia> using LinearAlgebra

julia> function f(D)
       s = zero(eltype(D))
       for i in eachindex(D)
           s += D[i]
       end
       s
       end
f (generic function with 1 method)

julia> D = Diagonal(1:3);

julia> @code_llvm f(D)

this indicates that the bounds checks are still present.

L28:                                              ; preds = %L65, %top
  %value_phi13 = phi i64 [ %value_phi51, %L65 ], [ 1, %top ]
  %value_phi14 = phi i64 [ %value_phi52, %L65 ], [ 1, %top ]
  %value_phi19 = phi i64 [ %9, %L65 ], [ 0, %top ]
;  @ REPL[2]:4 within `f`
; β”Œ @ abstractarray.jl:1312 within `getindex`
; β”‚β”Œ @ abstractarray.jl:1358 within `_getindex`
; β”‚β”‚β”Œ @ /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/diagonal.jl:177 within `getindex`
; β”‚β”‚β”‚β”Œ @ abstractarray.jl:699 within `checkbounds` @ abstractarray.jl:681
; β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:725 within `checkbounds_indices`
; β”‚β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:754 within `checkindex`
; β”‚β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:86 within `-`
         %3 = add i64 %value_phi13, -1
; β”‚β”‚β”‚β”‚β”‚β”‚β””
; β”‚β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:513 within `<`
         %4 = icmp uge i64 %3, %1
; β”‚β”‚β”‚β”‚β”‚β””β””
; β”‚β”‚β”‚β”‚β”‚ @ abstractarray.jl:725 within `checkbounds_indices` @ abstractarray.jl:725
; β”‚β”‚β”‚β”‚β”‚β”Œ @ abstractarray.jl:754 within `checkindex`
; β”‚β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:86 within `-`
         %5 = add i64 %value_phi14, -1
; β”‚β”‚β”‚β”‚β”‚β”‚β””
; β”‚β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:513 within `<`
         %6 = icmp uge i64 %5, %1
; β”‚β”‚β”‚β”‚β””β””β””
; β”‚β”‚β”‚β”‚ @ abstractarray.jl:699 within `checkbounds`
      %.not70 = or i1 %4, %6
      br i1 %.not70, label %L62, label %L65

L62:                                              ; preds = %L28
      %7 = getelementptr inbounds [2 x i64], ptr %"new::Tuple", i64 0, i64 1
; β”‚β”‚β”‚β”‚ @ abstractarray.jl:697 within `checkbounds`
      store i64 %value_phi13, ptr %"new::Tuple", align 8
      store i64 %value_phi14, ptr %7, align 8
; β”‚β”‚β”‚β”‚ @ abstractarray.jl:699 within `checkbounds`
      call void @j_throw_boundserror_1320(ptr nocapture nonnull readonly %"D::Diagonal", ptr nocapture nonnull readonly %"new::Tuple") #8
      unreachable
2 Likes