using BenchmarkTools
using LoopVectorization
function a1(x)
y = zero(eltype(x))
for i in eachindex(x)
y += x[i]
end
y
end
function a2(x)
y = zero(eltype(x))
@simd for i in eachindex(x)
y += x[i]
end
y
end
function a3(x)
y = zero(eltype(x))
@inbounds @fastmath for i in eachindex(x)
y += x[i]
end
return y
end
function a4(x)
y = zero(eltype(x))
@inbounds @fastmath @simd for i in eachindex(x)
y += x[i]
end
return y
end
function a5(x)
y = zero(eltype(x))
@turbo for i in eachindex(x)
y += x[i]
end
return y
end
x = rand(100_000_000);
@benchmark a1($x)
@benchmark a2($x)
@benchmark a3($x)
@benchmark a4($x)
@benchmark a5($x)
and I got these results which show that @simd and @inbounds @fastmath are similar and donβt really add much to it. Perhaps, the compiler was smart enough to compile a few things away or the algorithm doesnβt lend well to SIMD. So not really use whatβs the best general approach to speed up array processing code. My real code is a lot more complicated and involves calling some functions the in the reduce step
Summary
BenchmarkTools.Trial: 50 samples with 1 evaluation.
Range (min β¦ max): 98.881 ms β¦ 108.322 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 100.404 ms β GC (median): 0.00%
Time (mean Β± Ο): 101.257 ms Β± 2.343 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββ ββ ββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
98.9 ms Histogram: frequency by time 108 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 141 samples with 1 evaluation.
Range (min β¦ max): 34.740 ms β¦ 40.694 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 35.185 ms β GC (median): 0.00%
Time (mean Β± Ο): 35.555 ms Β± 1.002 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββββββ β βββ ββββ βββββββββββββββββββββββββββββββββββββββββ β
34.7 ms Histogram: frequency by time 40.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 138 samples with 1 evaluation.
Range (min β¦ max): 34.956 ms β¦ 43.002 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 36.081 ms β GC (median): 0.00%
Time (mean Β± Ο): 36.368 ms Β± 1.128 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β βββ β
β β ββββ β ββ βββββββββ ββββ β ββββββββββββββββββββββββββββββββββββ β
35 ms Histogram: frequency by time 40.9 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 140 samples with 1 evaluation.
Range (min β¦ max): 34.972 ms β¦ 45.533 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 35.587 ms β GC (median): 0.00%
Time (mean Β± Ο): 35.909 ms Β± 1.382 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
35 ms Histogram: frequency by time 44.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 140 samples with 1 evaluation.
Range (min β¦ max): 34.838 ms β¦ 41.175 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 35.573 ms β GC (median): 0.00%
Time (mean Β± Ο): 35.935 ms Β± 1.116 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
ββββββββββββ βββββ ββββββββββββββββββββββββββββββββββββββββββ β
34.8 ms Histogram: frequency by time 40.9 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
@fastmath has no effect on code inside function calls (itβs only rewrites some of the code it sees, it canβt change code it doesnβt see), and @simd is only a hint for the compiler to try harder to vectorise your loop, but if the body of the loop is too complicated, it wonβt do anything anyway. Also, @simd should only be used only when the conditions detailed in its docstring are satisfied, @fastmath is fairly useless with a bunch of functions and its entire point is to trade off accuracy with speed (if you care about accuracy you may reconsider using it), and also @inbounds should only ever used if youβre 100% sure that youβre accessing the elements within the bounds of the array.
So no, no one should suggest to always blindly use those macros without understanding their implications, just for the sake of maybe going faster.
Also eachindex(x) implies that the accesses to x with that index are inbounds and it communictaes this to the compiler. So @inbounds should completely irrelevant here.
N=10^8 is memory bound. Try on a bunch of random sizes from 1:512 or so.
E.g.
using Random
N = 512
p = Random.randperm(N);
x = rand(N);
y = similar(x);
function evalrandsizes(f, y, x, perm)
for (i,n) = enumerate(perm)
y[i] = @noinline f(@view(x[Base.oneto(perm[i])]))
end
end
@btime evalrandsizes(a1, $y, $x, $p)
@btime evalrandsizes(a2, $y, $x, $p)
@btime evalrandsizes(a3, $y, $x, $p)
@btime evalrandsizes(a4, $y, $x, $p)
@btime evalrandsizes(a5, $y, $x, $p)
To use these tools effectively (and safely!) itβs important to know how they achieve improved performance in a given setting.
@inbounds removes bounds checks. In βsimple indexingβ cases (e.g., iterating a range of indices) the compiler is usually smart enough to do this itself. But @inbounds can be useful if youβre iterating through a set of saved indices that you know are inbounds but that the compiler probably doesnβt. For example, x[findall(>(0), x)] would probably need @inbounds to avoid bounds checks (although there are better ways to write that expression).
@fastmath is too dangerous to use on more than a few lines at a time, and in many cases one can achieve similar performance by writing the code carefully (though sometimes this is tedious). But it can be very convenient to allow algebraic rewrites of intermediate-complexity equations that are not numerically sensitive and donβt require Inf/NaN handling (@fastmath will usually discard such βfrivolities,β which can lead to serious mistakes).
@simd permits just two of @fastmathβs transformations (the reassociating ones: reassoc and contract) and also gives a little code nudge towards SIMD. Itβs much safer than @fastmath but somewhat less powerful.
In your case here, the speedup is due to @fastmath or @simd allowing the loop to be vectorized (to accumulate multiple parallel sums and then combine them at the end). Normally, the compiler would not permit this because the answer is different than the totally-serial sum (because floating point math is non-associative). Note that if you were summing Ints then the compiler would likely do this without any annotations at all (since Int is associative over + and *).
@inbounds has no impact here because your use of eachindex (good job) permits the compiler to prove that all accesses are inbounds. The compiler might even be smart enough to know, for example, that 1:length(x) are inbounds for x::Vector (but eachindex and related constructs are still to be preferred).
This is not always the case, unfortunately. Only in simple cases can the compiler elide bounds-checks with eachindex. E.g.:
julia> using LinearAlgebra
julia> function f(D)
s = zero(eltype(D))
for i in eachindex(D)
s += D[i]
end
s
end
f (generic function with 1 method)
julia> D = Diagonal(1:3);
julia> @code_llvm f(D)
this indicates that the bounds checks are still present.