A fast sum. Any downsides?

I wrote a simple function to perform sums using indices.

function sumi(f, s, range)
	@simd for i in range
		s += f(i)
	end
	return s
end

I was curious to measure the performance hit of using sumi for arrays instead of sum.

using BenchmarkTools;
x = rand(10000);
@btime sumi(i -> $x[i], 0., eachindex($x))
@btime sum($x)
julia> @btime sumi(i -> $x[i], 0., eachindex($x))
  360.225 ns (0 allocations: 0 bytes)
5082.599473964153

julia> @btime sum($x)
  633.024 ns (0 allocations: 0 bytes)
5082.599473964152

Any idea why sumi seems faster? The precision appears similar. Also, sumi does check bounds. I also tried with integer arrays, with similar speed gains. (My Julia is 1.10.4).

I checked whether the number of elements in the vector matters.

xs = [rand(10^(i)) for i in 1:7]
reltimes = zeros(7)
global i = 1
while i <= 7
    x = xs[i]
    b1 = @benchmark sumi(j -> $x[j], 0., eachindex($x))
    b2 = @benchmark sum($x)
    reltimes[i] = median(b1.times) / median(b2.times)
    i += 1
end
using Plots
plot(1:7, [reltimes ones(7)])

The blue line is the median time taken by sumi relative to the median time taken by sum for array sizes from 10Ā¹ to 10ā·.

sum is better for arrays of 10 elements. Most speed gains for sumi are for arrays of 10,000 to 100,000 elements. Someone sees any downsides from using it instead of the built-in sum?

Depends on how much you stress it.

julia> x = fill(1.2f0, 10000000);

julia> sum(x)
1.2000002f7

julia> sumi(x, 0.0f0, eachindex(x))
1.1977598f7
2 Likes

Thanks, Iā€™ll stick with sum as much as possible. Any idea what may be going on with Integers?

julia> x = Int32.(12345ones(100000));

julia> @btime sumi(i -> $x[i], zero(Int32), eachindex($x))
  1.931 Ī¼s (0 allocations: 0 bytes)
1234500000

julia> @btime sum($x)
  4.788 Ī¼s (0 allocations: 0 bytes)
1234500000

julia> x = Int64.(12345ones(100000));

julia> @btime sumi(i -> $x[i], zero(Int64), eachindex($x))
  3.924 Ī¼s (0 allocations: 0 bytes)
1234500000

julia> @btime sum($x)
  5.774 Ī¼s (0 allocations: 0 bytes)
1234500000

The builtin sum uses a pairwise reduction (sums the input in two halves, recursively, until reaching a base size (usually <=1024) where it adds serially like your code). The splitting is why it achieves (typically) less error on large inputs.

But that doesnā€™t justify a 40% throughput reduction. The bottom level serial loop is

        @inbounds a1 = A[ifirst]
        @inbounds a2 = A[ifirst+1]
        v = op(f(a1), f(a2))
        @simd for i = ifirst + 2 : ilast
            @inbounds ai = A[i]
            v = op(v, f(ai))
        end

(where f = identity and op = add_plus which is equivalent to + in the float case and +(Int(x), Int(y) in the Int32 case). The whole point of having the serial base case is that it should yield reasonable accuracy benefits for a negligible cost, but that doesnā€™t seem to be the case here.

This isnā€™t very different from what your sumi does, except that maybe the pairwise subdivision ruins memory alignment or something. But I tried with 10240 elements and it wasnā€™t faster so that theory looks unlikely. Meanwhile,

function sumserial(x)
	s = zero(eltype(x))
	@simd for xi in x
		s += xi
	end
	return s
end

benchmarks competitively with your sumi. It seems thereā€™s something unnecessarily slow happening in mapreduce_impl(f, op, A::AbstractArrayOrBroadcasted, ifirst::Integer, ilast::Integer, blksize::Int).

1 Like

I did some benchmarking and there appears to be a big slowdown for sum (aka mapreduce) specifically somewhere in the ballpark of 6300 elements. That isnā€™t anywhere close to power of 2 over the base case size, so it doesnā€™t seem related to the pairwise algorithm specifically. The memory access pattern of these algorithms should be identical, so cache size shouldnā€™t matter (although alignment still could, but should matter at smaller sizes too).

function sumserial(x)
	s = zero(eltype(x))
	@simd for xi in x
		s += xi
	end
	return s
end
julia> for n in 6000:100:6500; x = fill(1.0, n); @show n; @btime sum($x); @btime sumserial($x); end
n = 6000
  259.650 ns (0 allocations: 0 bytes)
  233.776 ns (0 allocations: 0 bytes)
n = 6100
  271.358 ns (0 allocations: 0 bytes)
  234.475 ns (0 allocations: 0 bytes)
n = 6200
  289.639 ns (0 allocations: 0 bytes)
  247.117 ns (0 allocations: 0 bytes)
n = 6300
  316.882 ns (0 allocations: 0 bytes)
  254.792 ns (0 allocations: 0 bytes)
n = 6400
  410.770 ns (0 allocations: 0 bytes)
  245.812 ns (0 allocations: 0 bytes)
n = 6500
  409.310 ns (0 allocations: 0 bytes)
  255.896 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 22 Ɨ Intel(R) Core(TM) Ultra 7 165H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 22 virtual cores)

Does anyone have an explanation? It seems that there should maybe be an issue opened for this.

4 Likes

Is this a recent regression? If I remember correctly, sum used to have very similar performance to a simple @simd loop.

5 Likes

I donā€™t see this on my Apple M3 laptop. In fact, on my laptop, the built-in sum function is generally faster than sumserial:

julia> x = fill(1.0, 1000); @btime sum($x); @btime sumserial($x);
  71.551 ns (0 allocations: 0 bytes)
  68.763 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 2000); @btime sum($x); @btime sumserial($x);
  140.578 ns (0 allocations: 0 bytes)
  158.282 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 4000); @btime sum($x); @btime sumserial($x);
  278.578 ns (0 allocations: 0 bytes)
  339.901 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 8000); @btime sum($x); @btime sumserial($x);
  557.043 ns (0 allocations: 0 bytes)
  701.947 ns (0 allocations: 0 bytes)

julia> x = fill(1.0, 16000); @btime sum($x); @btime sumserial($x);
  1.113 Ī¼s (0 allocations: 0 bytes)
  1.354 Ī¼s (0 allocations: 0 bytes)
4 Likes

Thatā€™s my recollection, too. Interesting that this doesnā€™t reproduce on M3. Iā€™m on an Intel chip. I edited my above benchmark with the versioninfo.

I have no idea how itā€™s even possible to outperform by that margin without multithreading.

I can reproduce:

julia> r=rand(10_000); @btime sumserial($r); @btime sum($r);
  546.872 ns (0 allocations: 0 bytes)
  791.612 ns (0 allocations: 0 bytes)

We have to account for ~250 wasted nanoseconds.
Now, letā€™s look at @code_llvm sum(r). We see

%44 = call double @j_mapreduce_impl_11171(ptr nonnull %"a::Array", i64 signext 1, i64 signext %.size.sroa.0.0.copyload, i64 signext 1024)

So the block size is indeed 1024. This means that we will run 16 blocks of size 625 each, together with 15 ā€œmergeā€ operations. Each is a non-inlined function call, so we pay for function call, stack spill/restore, possibly shadow-stack marking, etc.

We also pay for intro/outtro of each simd block, i.e. for the stuff that doesnā€™t fit into our vector size at the end, and possibly has alignment issues at the front.

We pay 15ns per merge-op plus block-overhead ā€“ I donā€™t think this can be meaningfully reduced. 250ns is already eaten by 50 mispredicted branches (20 cycles, aka 5ns on my machine).

So Iā€™d say that this overhead is not very suprising :frowning:

I think we should just use serial. Or use something appropriate like leaf-size 1<<12, then serial over leaves, then the final leaf (such that most leaves have fixed compile-time-known size).

PS. Intel machine, AVX2 no AVX512

2 Likes

I just tried it on an (older) Intel machine, and there doesnā€™t seem to be much difference with older Julia versions, but my performance difference is much less than yours. So may be an issue specific to newer Intel chips?

In particular, I get:

julia> x = fill(1.0, 6500); @btime sum($x); @btime sumserial($x);
  541.265 ns (0 allocations: 0 bytes)
  433.241 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin24.0.0)
  CPU: 6 Ɨ Intel(R) Core(TM) i5-8500B CPU @ 3.00GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)

i.e. about 25% slower for sum than for sumserial for n=6500, vs. a 60% slowdown on your (newer) CPU. With Julia 1.0 on the same machine, I get:

julia> x = fill(1.0, 6500); @btime sum($x); @btime sumserial($x);
  527.424 ns (0 allocations: 0 bytes)
  422.462 ns (0 allocations: 0 bytes)

or about 2ā€“3% faster (this seems repeatable), so there was a very slight performance regression.

1 Like

The counterpoint is that the performance is very close up to 6000 elements, which already requires several levels of recursion. I show benchmarks below for 5000-8000, which also require exactly the same recursion depth. On v1.11, the performance gap goes from 1.1x to 1.8x across this range. But read onā€¦

I had v.1.9.4 installed so I tried repeating on that. The plot thickens. On v1.9 sumserial was significantly faster at the small sizes (but matches at larger) than in v1.11 while sum remains similar in both versions. At larger sizes, the speed is roughly the same for both functions in both versions. Benchmarks below:

julia> for n in 1000:1000:10000; x = fill(1.0, n); @show n; @btime sum($x); @btime sumserial($x); end

v1.9.4:
n=5000: sum: 203ns, sumserial: 141ns
n=6000: sum: 265ns, sumserial: 169ns
n=7000: sum: 489ns, sumserial: 301ns
n=8000: sum: 568ns, sumserial: 340ns
v1.11.2:
n=5000: sum: 222ns, sumserial: 199ns
n=6000: sum: 256ns, sumserial: 229ns
n=7000: sum: 489ns, sumserial: 277ns
n=8000: sum: 610ns, sumserial: 340ns

I wonder if thereā€™s some fishy business with the heterogeneous cores on my chip. For example, that itā€™s consistently running some calls on the performance cores and the other on economy. Not sure. It looks like my hardware is having at least some effect, although the version-dependent speed is probably not the hardware.

1 Like

I think the issue can be seen with the following:

julia> function badSum(r) #similar to reduce.jl
       @inbounds a1 = r[1]
       @inbounds a2=r[2]
       v = a1 + a2
       @simd for i=3:length(r)
       @inbounds ai = r[i]
       v = v + ai
       end
       v
       end

julia> r0=rand(625); @btime sum($r0); @btime sumserial($r0); @btime badSum($r0);
  45.074 ns (0 allocations: 0 bytes)
  30.134 ns (0 allocations: 0 bytes)
  43.148 ns (0 allocations: 0 bytes)

Since the above example on my machine had 16 blocks, this almost perfectly reproduces the discrepancy.

It appears that peeling off the initial loop part somehow makes llvm very unhappy.

5 Likes

Itā€™s a surprising pattern on AMD too.

function sum1(r) 
    v = zero(eltype(r))
    @simd for i=1:length(r)
        @inbounds ai = r[i]
        v = v + ai
    end
    v 
end

function sum2(r) 
    v = zero(eltype(r))
    @simd for i=2:length(r)
        @inbounds ai = r[i]
        v = v + ai
    end
    v 
end

function sum3(r) 
    v = zero(eltype(r))
    @simd for i=3:length(r)
        @inbounds ai = r[i]
        v = v + ai
    end
    v 
end

function allsum(n)
    r = rand(n)
    @btime sum1($r)
    @btime sum2($r)
    @btime sum3($r)
end

julia> allsum(625)
  45.625 ns (0 allocations: 0 bytes)
  62.187 ns (0 allocations: 0 bytes)
  49.271 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.11.2
Commit 5e9a32e7af2 (2024-12-01 20:02 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 Ɨ AMD Ryzen 7 2700X Eight-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver1)

Nothing special with 625, itā€™s the same pattern for other sizes, except for even sizes, sum3 becomes a bit faster than sum1.

Ok, weā€™re stupid, it appears to be alignment of the simd loop after all: (disregard, see discussion below, itā€™s probably not alignment)

julia> function xsum(n)
       r=rand(n)
       for i=1:32
       @show Int(convert(UInt,pointer(r))&0xff)
       @btime sum1($r)
       popfirst!(r)
       end
       end
julia> xsum(625)
Int(convert(UInt, pointer(r)) & 0xff) = 64
  30.943 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 72
  33.419 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 80
  41.223 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 88
  40.421 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 96
  37.823 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 104
  38.421 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 112
  37.622 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 120
  36.823 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 128
  34.482 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 136
  35.611 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 144
  34.821 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 152
  33.723 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 160
  31.423 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 168
  32.223 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 176
  33.019 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 184
  32.223 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 192
  28.335 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 200
  32.618 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 208
  40.421 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 216
  39.621 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 224
  37.022 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 232
  37.622 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 240
  36.822 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 248
  36.022 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 0
  33.823 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 8
  34.625 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 16
  34.021 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 24
  32.922 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 32
  30.623 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 40
  31.424 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 48
  32.622 ns (0 allocations: 0 bytes)
Int(convert(UInt, pointer(r)) & 0xff) = 56
  31.020 ns (0 allocations: 0 bytes)
2 Likes

Thatā€™s weird, I thought the compiler used to be able peel off a few iterations as needed to ensure that the remaining accesses are aligned, at least for a simple loop like this.

In any case, it should be fixable on our end by manual peeling of 8ā€“32 bytes worth of elements (ugh).

3 Likes

This definitely feels like something LLVM should be doing for us.

4 Likes

Sigh. I guessed too fast and wrong, apparently. After benchmarking

julia> function xsum(n)
       r=rand(n)
       for i=1:32
       @show Int(convert(UInt,pointer(r))&0xff)
       @btime sum1($r)
       pop!(r)
       end
       end

and reviewing both code_llvm and code_native:

  1. The overhead is O(1)
  2. There is no aligning loop header.
  3. The critical part appears to be the loop footer, i.e. the scalar part
  4. On my machine, the unroll/loop thing boils down to 16 elements per loop iteration (4 x avx2 times 4 x unrolled).
  5. behold:
julia> divrem(625, 16)
(39, 1)
julia> allsum(625)
  30.846 ns (0 allocations: 0 bytes)
  33.416 ns (0 allocations: 0 bytes)
  41.019 ns (0 allocations: 0 bytes)
315.9750770641492

julia> allsum(625+8)
  35.420 ns (0 allocations: 0 bytes)
  36.321 ns (0 allocations: 0 bytes)
  35.622 ns (0 allocations: 0 bytes)
315.7261902254435

julia> allsum(625+15)
  29.128 ns (0 allocations: 0 bytes)
  41.919 ns (0 allocations: 0 bytes)
  41.119 ns (0 allocations: 0 byt

So what does this boil down to?

Iā€™d say our sum / mapreduce_impl should split into blocks of size 2^k*n+2 instead of recursively halving blocks, for appropriate k.

1 Like

LLVMā€™s vectorizer is very not good at this sort of thing.

1 Like

I always wondered why it didnā€™t choose ā€œnicerā€ block sizes to begin with. Ideally something nicely recursively subdivisible like pure powers of 2 times the block size (is that exactly what you suggest, modulo the extra +2 for the initialization? I donā€™t know what your k was. Ideally we can avoid the annoying +2 for initialization ā€“ mightnā€™t SIMD prefer something like +8 anyway? #53945 may ease all of this if init is available.). Or even just any multiple of some modest power of 2 (something in the 2^5 to 2^8 range) might be useful.

1 Like