A fast sum. Any downsides?

nicolas · December 11, 2024, 6:23pm

I wrote a simple function to perform sums using indices.

function sumi(f, s, range)
	@simd for i in range
		s += f(i)
	end
	return s
end

I was curious to measure the performance hit of using sumi for arrays instead of sum.

using BenchmarkTools;
x = rand(10000);
@btime sumi(i -> $x[i], 0., eachindex($x))
@btime sum($x)

julia> @btime sumi(i -> $x[i], 0., eachindex($x))
  360.225 ns (0 allocations: 0 bytes)
5082.599473964153

julia> @btime sum($x)
  633.024 ns (0 allocations: 0 bytes)
5082.599473964152

Any idea why sumi seems faster? The precision appears similar. Also, sumi does check bounds. I also tried with integer arrays, with similar speed gains. (My Julia is 1.10.4).

I checked whether the number of elements in the vector matters.

xs = [rand(10^(i)) for i in 1:7]
reltimes = zeros(7)
global i = 1
while i <= 7
    x = xs[i]
    b1 = @benchmark sumi(j -> $x[j], 0., eachindex($x))
    b2 = @benchmark sum($x)
    reltimes[i] = median(b1.times) / median(b2.times)
    i += 1
end
using Plots
plot(1:7, [reltimes ones(7)])

The blue line is the median time taken by sumi relative to the median time taken by sum for array sizes from 10¹ to 10⁷.

sum is better for arrays of 10 elements. Most speed gains for sumi are for arrays of 10,000 to 100,000 elements. Someone sees any downsides from using it instead of the built-in sum?

GunnarFarneback · December 11, 2024, 6:36pm

Depends on how much you stress it.

julia> x = fill(1.2f0, 10000000);

julia> sum(x)
1.2000002f7

julia> sumi(x, 0.0f0, eachindex(x))
1.1977598f7

nicolas · December 11, 2024, 7:21pm

Thanks, I’ll stick with sum as much as possible. Any idea what may be going on with Integers?

julia> x = Int32.(12345ones(100000));

julia> @btime sumi(i -> $x[i], zero(Int32), eachindex($x))
  1.931 μs (0 allocations: 0 bytes)
1234500000

julia> @btime sum($x)
  4.788 μs (0 allocations: 0 bytes)
1234500000

julia> x = Int64.(12345ones(100000));

julia> @btime sumi(i -> $x[i], zero(Int64), eachindex($x))
  3.924 μs (0 allocations: 0 bytes)
1234500000

julia> @btime sum($x)
  5.774 μs (0 allocations: 0 bytes)
1234500000

mikmoore · December 11, 2024, 8:29pm

The builtin sum uses a pairwise reduction (sums the input in two halves, recursively, until reaching a base size (usually <=1024) where it adds serially like your code). The splitting is why it achieves (typically) less error on large inputs.

But that doesn’t justify a 40% throughput reduction. The bottom level serial loop is

        @inbounds a1 = A[ifirst]
        @inbounds a2 = A[ifirst+1]
        v = op(f(a1), f(a2))
        @simd for i = ifirst + 2 : ilast
            @inbounds ai = A[i]
            v = op(v, f(ai))
        end

(where f = identity and op = add_plus which is equivalent to + in the float case and +(Int(x), Int(y) in the Int32 case). The whole point of having the serial base case is that it should yield reasonable accuracy benefits for a negligible cost, but that doesn’t seem to be the case here.

This isn’t very different from what your sumi does, except that maybe the pairwise subdivision ruins memory alignment or something. But I tried with 10240 elements and it wasn’t faster so that theory looks unlikely. Meanwhile,

function sumserial(x)
	s = zero(eltype(x))
	@simd for xi in x
		s += xi
	end
	return s
end

benchmarks competitively with your sumi. It seems there’s something unnecessarily slow happening in mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize::Int).

mikmoore · December 11, 2024, 8:47pm

I did some benchmarking and there appears to be a big slowdown for sum (aka mapreduce) specifically somewhere in the ballpark of 6300 elements. That isn’t anywhere close to power of 2 over the base case size, so it doesn’t seem related to the pairwise algorithm specifically. The memory access pattern of these algorithms should be identical, so cache size shouldn’t matter (although alignment still could, but should matter at smaller sizes too).

function sumserial(x)
	s = zero(eltype(x))
	@simd for xi in x
		s += xi
	end
	return s
end

julia> for n in 6000:100:6500; x = fill(1.0, n); @show n; @btime sum($x); @btime sumserial($x); end
n = 6000
  259.650 ns (0 allocations: 0 bytes)
  233.776 ns (0 allocations: 0 bytes)
n = 6100
  271.358 ns (0 allocations: 0 bytes)
  234.475 ns (0 allocations: 0 bytes)
n = 6200
  289.639 ns (0 allocations: 0 bytes)
  247.117 ns (0 allocations: 0 bytes)
n = 6300
  316.882 ns (0 allocations: 0 bytes)
  254.792 ns (0 allocations: 0 bytes)
n = 6400
  410.770 ns (0 allocations: 0 bytes)
  245.812 ns (0 allocations: 0 bytes)
n = 6500
  409.310 ns (0 allocations: 0 bytes)
  255.896 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 22 × Intel(R) Core(TM) Ultra 7 165H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 22 virtual cores)

Does anyone have an explanation? It seems that there should maybe be an issue opened for this.

stevengj · December 11, 2024, 9:19pm

Is this a recent regression? If I remember correctly, sum used to have very similar performance to a simple @simd loop.

stevengj · December 11, 2024, 9:27pm

I don’t see this on my Apple M3 laptop. In fact, on my laptop, the built-in sum function is generally faster than sumserial:

julia> x = fill(1.0, 1000); @btime sum($x); @btime sumserial($x);
  71.551 ns (0 allocations: 0 bytes)
  68.763 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 2000); @btime sum($x); @btime sumserial($x);
  140.578 ns (0 allocations: 0 bytes)
  158.282 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 4000); @btime sum($x); @btime sumserial($x);
  278.578 ns (0 allocations: 0 bytes)
  339.901 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 8000); @btime sum($x); @btime sumserial($x);
  557.043 ns (0 allocations: 0 bytes)
  701.947 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 16000); @btime sum($x); @btime sumserial($x);
  1.113 μs (0 allocations: 0 bytes)
  1.354 μs (0 allocations: 0 bytes)

mikmoore · December 11, 2024, 10:08pm

That’s my recollection, too. Interesting that this doesn’t reproduce on M3. I’m on an Intel chip. I edited my above benchmark with the versioninfo.

I have no idea how it’s even possible to outperform by that margin without multithreading.

foobar_lv2 · December 11, 2024, 11:17pm

I can reproduce:

julia> r=rand(10_000); @btime sumserial($r); @btime sum($r);
  546.872 ns (0 allocations: 0 bytes)
  791.612 ns (0 allocations: 0 bytes)

We have to account for ~250 wasted nanoseconds.
Now, let’s look at @code_llvm sum(r). We see

%44 = call double @j_mapreduce_impl_11171(ptr nonnull %"a::Array", i64 signext 1, i64 signext %.size.sroa.0.0.copyload, i64 signext 1024)

So the block size is indeed 1024. This means that we will run 16 blocks of size 625 each, together with 15 “merge” operations. Each is a non-inlined function call, so we pay for function call, stack spill/restore, possibly shadow-stack marking, etc.

We also pay for intro/outtro of each simd block, i.e. for the stuff that doesn’t fit into our vector size at the end, and possibly has alignment issues at the front.

We pay 15ns per merge-op plus block-overhead – I don’t think this can be meaningfully reduced. 250ns is already eaten by 50 mispredicted branches (20 cycles, aka 5ns on my machine).

So I’d say that this overhead is not very suprising

I think we should just use serial. Or use something appropriate like leaf-size 1<<12, then serial over leaves, then the final leaf (such that most leaves have fixed compile-time-known size).

PS. Intel machine, AVX2 no AVX512

stevengj · December 11, 2024, 11:17pm

I just tried it on an (older) Intel machine, and there doesn’t seem to be much difference with older Julia versions, but my performance difference is much less than yours. So may be an issue specific to newer Intel chips?

In particular, I get:

julia> x = fill(1.0, 6500); @btime sum($x); @btime sumserial($x);
  541.265 ns (0 allocations: 0 bytes)
  433.241 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin24.0.0)
  CPU: 6 × Intel(R) Core(TM) i5-8500B CPU @ 3.00GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)

i.e. about 25% slower for sum than for sumserial for n=6500, vs. a 60% slowdown on your (newer) CPU. With Julia 1.0 on the same machine, I get:

julia> x = fill(1.0, 6500); @btime sum($x); @btime sumserial($x);
  527.424 ns (0 allocations: 0 bytes)
  422.462 ns (0 allocations: 0 bytes)

or about 2–3% faster (this seems repeatable), so there was a very slight performance regression.

mikmoore · December 12, 2024, 12:01am

The counterpoint is that the performance is very close up to 6000 elements, which already requires several levels of recursion. I show benchmarks below for 5000-8000, which also require exactly the same recursion depth. On v1.11, the performance gap goes from 1.1x to 1.8x across this range. But read on…

I had v.1.9.4 installed so I tried repeating on that. The plot thickens. On v1.9 sumserial was significantly faster at the small sizes (but matches at larger) than in v1.11 while sum remains similar in both versions. At larger sizes, the speed is roughly the same for both functions in both versions. Benchmarks below:

julia> for n in 1000:1000:10000; x = fill(1.0, n); @show n; @btime sum($x); @btime sumserial($x); end

v1.9.4:
n=5000: sum: 203ns, sumserial: 141ns
n=6000: sum: 265ns, sumserial: 169ns
n=7000: sum: 489ns, sumserial: 301ns
n=8000: sum: 568ns, sumserial: 340ns
v1.11.2:
n=5000: sum: 222ns, sumserial: 199ns
n=6000: sum: 256ns, sumserial: 229ns
n=7000: sum: 489ns, sumserial: 277ns
n=8000: sum: 610ns, sumserial: 340ns

I wonder if there’s some fishy business with the heterogeneous cores on my chip. For example, that it’s consistently running some calls on the performance cores and the other on economy. Not sure. It looks like my hardware is having at least some effect, although the version-dependent speed is probably not the hardware.

foobar_lv2 · December 12, 2024, 1:24am

I think the issue can be seen with the following:

julia> function badSum(r) #similar to reduce.jl
       @inbounds a1 = r[1]
       @inbounds a2=r[2]
       v = a1 + a2
       @simd for i=3:length(r)
       @inbounds ai = r[i]
       v = v + ai
       end
       v
       end

julia> r0=rand(625); @btime sum($r0); @btime sumserial($r0); @btime badSum($r0);
  45.074 ns (0 allocations: 0 bytes)
  30.134 ns (0 allocations: 0 bytes)
  43.148 ns (0 allocations: 0 bytes)

Since the above example on my machine had 16 blocks, this almost perfectly reproduces the discrepancy.

It appears that peeling off the initial loop part somehow makes llvm very unhappy.

sgaure · December 13, 2024, 2:51pm

It’s a surprising pattern on AMD too.

function sum1(r) 
    v = zero(eltype(r))
    @simd for i=1:length(r)
        @inbounds ai = r[i]
        v = v + ai
    end
    v 
end

function sum2(r) 
    v = zero(eltype(r))
    @simd for i=2:length(r)
        @inbounds ai = r[i]
        v = v + ai
    end
    v 
end

function sum3(r) 
    v = zero(eltype(r))
    @simd for i=3:length(r)
        @inbounds ai = r[i]
        v = v + ai
    end
    v 
end

function allsum(n)
    r = rand(n)
    @btime sum1($r)
    @btime sum2($r)
    @btime sum3($r)
end

julia> allsum(625)
  45.625 ns (0 allocations: 0 bytes)
  62.187 ns (0 allocations: 0 bytes)
  49.271 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver1)

Nothing special with 625, it’s the same pattern for other sizes, except for even sizes, sum3 becomes a bit faster than sum1.

foobar_lv2 · December 13, 2024, 3:32pm

~~Ok, we’re stupid, it appears to be alignment of the simd loop after all:~~ (disregard, see discussion below, it’s probably not alignment)

julia> function xsum(n)
       r=rand(n)
       for i=1:32
       @show Int(convert(UInt,pointer(r))&0xff)
       @btime sum1($r)
       popfirst!(r)
       end
       end
julia> xsum(625)
Int(convert(UInt, pointer(r)) & 0xff) = 64
  30.943 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 72
  33.419 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 80
  41.223 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 88
  40.421 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 96
  37.823 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 104
  38.421 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 112
  37.622 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 120
  36.823 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 128
  34.482 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 136
  35.611 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 144
  34.821 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 152
  33.723 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 160
  31.423 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 168
  32.223 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 176
  33.019 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 184
  32.223 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 192
  28.335 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 200
  32.618 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 208
  40.421 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 216
  39.621 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 224
  37.022 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 232
  37.622 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 240
  36.822 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 248
  36.022 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 0
  33.823 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 8
  34.625 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 16
  34.021 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 24
  32.922 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 32
  30.623 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 40
  31.424 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 48
  32.622 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 56
  31.020 ns (0 allocations: 0 bytes)

stevengj · December 13, 2024, 4:12pm

That’s weird, I thought the compiler used to be able peel off a few iterations as needed to ensure that the remaining accesses are aligned, at least for a simple loop like this.

In any case, it should be fixable on our end by manual peeling of 8–32 bytes worth of elements (ugh).

StefanKarpinski · December 13, 2024, 4:23pm

This definitely feels like something LLVM should be doing for us.

foobar_lv2 · December 13, 2024, 8:09pm

Sigh. I guessed too fast and wrong, apparently. After benchmarking

julia> function xsum(n)
       r=rand(n)
       for i=1:32
       @show Int(convert(UInt,pointer(r))&0xff)
       @btime sum1($r)
       pop!(r)
       end
       end

and reviewing both code_llvm and code_native:

The overhead is O(1)
There is no aligning loop header.
The critical part appears to be the loop footer, i.e. the scalar part
On my machine, the unroll/loop thing boils down to 16 elements per loop iteration (4 x avx2 times 4 x unrolled).
behold:

julia> divrem(625, 16)
(39, 1)
julia> allsum(625)
  30.846 ns (0 allocations: 0 bytes)
  33.416 ns (0 allocations: 0 bytes)
  41.019 ns (0 allocations: 0 bytes)
315.9750770641492

julia> allsum(625+8)
  35.420 ns (0 allocations: 0 bytes)
  36.321 ns (0 allocations: 0 bytes)
  35.622 ns (0 allocations: 0 bytes)
315.7261902254435

julia> allsum(625+15)
  29.128 ns (0 allocations: 0 bytes)
  41.919 ns (0 allocations: 0 bytes)
  41.119 ns (0 allocations: 0 byt

So what does this boil down to?

I’d say our sum / mapreduce_impl should split into blocks of size 2^k*n+2 instead of recursively halving blocks, for appropriate k.

Oscar_Smith · December 13, 2024, 8:11pm

LLVM’s vectorizer is very not good at this sort of thing.

mikmoore · December 16, 2024, 3:04pm

I always wondered why it didn’t choose “nicer” block sizes to begin with. Ideally something nicely recursively subdivisible like pure powers of 2 times the block size (is that exactly what you suggest, modulo the extra +2 for the initialization? I don’t know what your k was. Ideally we can avoid the annoying +2 for initialization – mightn’t SIMD prefer something like +8 anyway? #53945 may ease all of this if init is available.). Or even just any multiple of some modest power of 2 (something in the 2^5 to 2^8 range) might be useful.

Topic		Replies	Views
Why the function sum1 is faster than builtin sum Performance question , profiling	14	1039	September 21, 2023
Sum performance for Array{Float64,2} elements Performance	13	2468	May 15, 2018
Help understanding vectorization (or lack thereof) Performance	15	1212	June 8, 2018
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7024	January 29, 2019
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1795	October 7, 2019

A fast sum. Any downsides?

Related topics