Any idea why sumi seems faster? The precision appears similar. Also, sumi does check bounds. I also tried with integer arrays, with similar speed gains. (My Julia is 1.10.4).
I checked whether the number of elements in the vector matters.
xs = [rand(10^(i)) for i in 1:7]
reltimes = zeros(7)
global i = 1
while i <= 7
x = xs[i]
b1 = @benchmark sumi(j -> $x[j], 0., eachindex($x))
b2 = @benchmark sum($x)
reltimes[i] = median(b1.times) / median(b2.times)
i += 1
end
using Plots
plot(1:7, [reltimes ones(7)])
The blue line is the median time taken by sumi relative to the median time taken by sum for array sizes from 10Ā¹ to 10ā·.
sum is better for arrays of 10 elements. Most speed gains for sumi are for arrays of 10,000 to 100,000 elements. Someone sees any downsides from using it instead of the built-in sum?
The builtin sum uses a pairwise reduction (sums the input in two halves, recursively, until reaching a base size (usually <=1024) where it adds serially like your code). The splitting is why it achieves (typically) less error on large inputs.
But that doesnāt justify a 40% throughput reduction. The bottom level serial loop is
@inbounds a1 = A[ifirst]
@inbounds a2 = A[ifirst+1]
v = op(f(a1), f(a2))
@simd for i = ifirst + 2 : ilast
@inbounds ai = A[i]
v = op(v, f(ai))
end
(where f = identity and op = add_plus which is equivalent to + in the float case and +(Int(x), Int(y) in the Int32 case). The whole point of having the serial base case is that it should yield reasonable accuracy benefits for a negligible cost, but that doesnāt seem to be the case here.
This isnāt very different from what your sumi does, except that maybe the pairwise subdivision ruins memory alignment or something. But I tried with 10240 elements and it wasnāt faster so that theory looks unlikely. Meanwhile,
function sumserial(x)
s = zero(eltype(x))
@simd for xi in x
s += xi
end
return s
end
benchmarks competitively with your sumi. It seems thereās something unnecessarily slow happening in mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize::Int).
I did some benchmarking and there appears to be a big slowdown for sum (aka mapreduce) specifically somewhere in the ballpark of 6300 elements. That isnāt anywhere close to power of 2 over the base case size, so it doesnāt seem related to the pairwise algorithm specifically. The memory access pattern of these algorithms should be identical, so cache size shouldnāt matter (although alignment still could, but should matter at smaller sizes too).
function sumserial(x)
s = zero(eltype(x))
@simd for xi in x
s += xi
end
return s
end
julia> for n in 6000:100:6500; x = fill(1.0, n); @show n; @btime sum($x); @btime sumserial($x); end
n = 6000
259.650 ns (0 allocations: 0 bytes)
233.776 ns (0 allocations: 0 bytes)
n = 6100
271.358 ns (0 allocations: 0 bytes)
234.475 ns (0 allocations: 0 bytes)
n = 6200
289.639 ns (0 allocations: 0 bytes)
247.117 ns (0 allocations: 0 bytes)
n = 6300
316.882 ns (0 allocations: 0 bytes)
254.792 ns (0 allocations: 0 bytes)
n = 6400
410.770 ns (0 allocations: 0 bytes)
245.812 ns (0 allocations: 0 bytes)
n = 6500
409.310 ns (0 allocations: 0 bytes)
255.896 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 22 Ć Intel(R) Core(TM) Ultra 7 165H
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 22 virtual cores)
Does anyone have an explanation? It seems that there should maybe be an issue opened for this.
Thatās my recollection, too. Interesting that this doesnāt reproduce on M3. Iām on an Intel chip. I edited my above benchmark with the versioninfo.
I have no idea how itās even possible to outperform by that margin without multithreading.
So the block size is indeed 1024. This means that we will run 16 blocks of size 625 each, together with 15 āmergeā operations. Each is a non-inlined function call, so we pay for function call, stack spill/restore, possibly shadow-stack marking, etc.
We also pay for intro/outtro of each simd block, i.e. for the stuff that doesnāt fit into our vector size at the end, and possibly has alignment issues at the front.
We pay 15ns per merge-op plus block-overhead ā I donāt think this can be meaningfully reduced. 250ns is already eaten by 50 mispredicted branches (20 cycles, aka 5ns on my machine).
So Iād say that this overhead is not very suprising
I think we should just use serial. Or use something appropriate like leaf-size 1<<12, then serial over leaves, then the final leaf (such that most leaves have fixed compile-time-known size).
I just tried it on an (older) Intel machine, and there doesnāt seem to be much difference with older Julia versions, but my performance difference is much less than yours. So may be an issue specific to newer Intel chips?
The counterpoint is that the performance is very close up to 6000 elements, which already requires several levels of recursion. I show benchmarks below for 5000-8000, which also require exactly the same recursion depth. On v1.11, the performance gap goes from 1.1x to 1.8x across this range. But read onā¦
I had v.1.9.4 installed so I tried repeating on that. The plot thickens. On v1.9 sumserial was significantly faster at the small sizes (but matches at larger) than in v1.11 while sum remains similar in both versions. At larger sizes, the speed is roughly the same for both functions in both versions. Benchmarks below:
julia> for n in 1000:1000:10000; x = fill(1.0, n); @show n; @btime sum($x); @btime sumserial($x); end
I wonder if thereās some fishy business with the heterogeneous cores on my chip. For example, that itās consistently running some calls on the performance cores and the other on economy. Not sure. It looks like my hardware is having at least some effect, although the version-dependent speed is probably not the hardware.
function sum1(r)
v = zero(eltype(r))
@simd for i=1:length(r)
@inbounds ai = r[i]
v = v + ai
end
v
end
function sum2(r)
v = zero(eltype(r))
@simd for i=2:length(r)
@inbounds ai = r[i]
v = v + ai
end
v
end
function sum3(r)
v = zero(eltype(r))
@simd for i=3:length(r)
@inbounds ai = r[i]
v = v + ai
end
v
end
function allsum(n)
r = rand(n)
@btime sum1($r)
@btime sum2($r)
@btime sum3($r)
end
julia> allsum(625)
45.625 ns (0 allocations: 0 bytes)
62.187 ns (0 allocations: 0 bytes)
49.271 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 Ć AMD Ryzen 7 2700X Eight-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver1)
Nothing special with 625, itās the same pattern for other sizes, except for even sizes, sum3 becomes a bit faster than sum1.
Thatās weird, I thought the compiler used to be able peel off a few iterations as needed to ensure that the remaining accesses are aligned, at least for a simple loop like this.
In any case, it should be fixable on our end by manual peeling of 8ā32 bytes worth of elements (ugh).
I always wondered why it didnāt choose ānicerā block sizes to begin with. Ideally something nicely recursively subdivisible like pure powers of 2 times the block size (is that exactly what you suggest, modulo the extra +2 for the initialization? I donāt know what your k was. Ideally we can avoid the annoying +2 for initialization ā mightnāt SIMD prefer something like +8 anyway? #53945 may ease all of this if init is available.). Or even just any multiple of some modest power of 2 (something in the 2^5 to 2^8 range) might be useful.