Why the function sum1 is faster than builtin sum

``````julia> using BenchmarkTools

julia> function sum1(num)
total = 0
for i in num
total += i
end
end
sum1 (generic function with 1 method)

julia> a = collect(1:1000);

julia> @btime b = sum(a)
99.192 ns (1 allocation: 16 bytes)
500500

julia> @btime b = sum1(\$a)
71.325 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
71.288 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum(a)
99.042 ns (1 allocation: 16 bytes)
500500

julia> @btime b = sum(a)
99.042 ns (1 allocation: 16 bytes)
500500

julia> @btime b = sum1(\$a)
71.180 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
71.434 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
71.288 ns (0 allocations: 0 bytes)
500500

julia> a = 1:1000
1:1000

julia> @btime b = sum1(\$a)
2.245 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum(a)
29.193 ns (1 allocation: 16 bytes)
500500
``````

Why the function `sum1` is faster than builtin `sum` function? I just expected the opposite.
Why the functions `sum1` and `sum` are faster when input is just `a = 1:1000` ?

You seem to be interpolating `a` at some points and not at others, this could probably affect timing.

When the input is a range type I guess the standard sum dispatches to a method using some summation formula, e.g. S(n)=n*(n+1)/2, or similar. Not sure why sum1 also improves that much, 2ns seems like the compiler might have calculated it beforehand?

2 Likes

I confirm the difference, but itβs not really consistent (see below), but notice that the execution time varies for different values, so I donβt think it (only) does the famous shortcut via the Gaussian formula, which should run with the same speed for all integers and moreover, it should be a bit faster (itβs just a handful of operations, so it should run within less than 30ns or so). For bigger numbers itβs getting slower and slower.

The allocation in the OP call comes from the missing interpolation as @albheim pointed out.
The input is not a range type btw. since the range is passed to `collect()`.

The code is different for both methods:

``````julia> a = collect(1:1000);

julia> @btime b = sum(\$a)
122.924 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
144.073 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
144.073 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum(\$a)
122.924 ns (0 allocations: 0 bytes)
500500

julia> a = collect(1:10000);

julia> @btime b = sum(\$a)
1.200 ΞΌs (0 allocations: 0 bytes)
50005000

julia> @btime b = sum1(\$a)
1.521 ΞΌs (0 allocations: 0 bytes)
50005000

julia> a = collect(1:100000);

julia> @btime b = sum(\$a)
12.333 ΞΌs (0 allocations: 0 bytes)
5000050000

julia> @btime b = sum1(\$a)
15.500 ΞΌs (0 allocations: 0 bytes)
5000050000

julia> @code_warntype sum1(a)
MethodInstance for sum1(::Vector{Int64})
from sum1(num) @ Main REPL[1]:1
Arguments
#self#::Core.Const(sum1)
num::Vector{Int64}
Locals
@_3::Union{Nothing, Tuple{Int64, Int64}}
total::Int64
i::Int64
Body::Int64
1 β       (total = 0)
β   %2  = num::Vector{Int64}
β         (@_3 = Base.iterate(%2))
β   %4  = (@_3 === nothing)::Bool
β   %5  = Base.not_int(%4)::Bool
βββ       goto #4 if not %5
2 β %7  = @_3::Tuple{Int64, Int64}
β         (i = Core.getfield(%7, 1))
β   %9  = Core.getfield(%7, 2)::Int64
β         (total = total + i)
β         (@_3 = Base.iterate(%2, %9))
β   %12 = (@_3 === nothing)::Bool
β   %13 = Base.not_int(%12)::Bool
βββ       goto #4 if not %13
3 β       goto #2

julia> @code_warntype sum(a)
MethodInstance for sum(::Vector{Int64})
from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:994
Arguments
#self#::Core.Const(sum)
a::Vector{Int64}
Body::Int64
1 β      nothing
β   %2 = Base.:(:)::Core.Const(Colon())
β   %3 = Core.NamedTuple()::Core.Const(NamedTuple())
β   %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
β   %5 = Base.:(var"#sum#807")(%2, %4, #self#, a)::Int64
βββ      return %5

julia> a = collect(1:1000);

julia> @btime b = sum(\$a)
122.924 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
121.961 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum(\$a)
122.924 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1(\$a)
121.961 ns (0 allocations: 0 bytes)
500500
``````
1 Like

You are spot on. My mistake. Interpolating `b = sum(\$a)` is speeding up the code.

2 Likes

The naive summation loop is about as fast as you can do it with a single thread, given automatic SIMD optimizations. Apart from the interpolation difference it looks like `Base.sum` is still somewhat slower and there should be some minor optimization opportunities in the maze of `mapreduce` machinery that `sum` enters.

LLVM is very good at recognizing this specific loop structure and optimizing it to the closed form solution. You can see here that itβs multiplying things rather than doing a summation loop:

``````julia> @code_llvm sum1(1:1000)
;  @ REPL[22]:1 within `sum1`
define i64 @julia_sum1_462([2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
;  @ REPL[22]:3 within `sum1`
; β @ range.jl:897 within `iterate`
; ββ @ range.jl:672 within `isempty`
; βββ @ range.jl:834 within `first`
; ββββ @ Base.jl:37 within `getproperty`
%1 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
; ββββ
; βββ @ range.jl:839 within `last`
; ββββ @ Base.jl:37 within `getproperty`
%2 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
; ββββ
; βββ @ operators.jl:376 within `>`
; ββββ @ int.jl:83 within `<`
%unbox = load i64, i64* %2, align 8
%unbox1 = load i64, i64* %1, align 8
%.not = icmp slt i64 %unbox, %unbox1
; ββββ
br i1 %.not, label %L30, label %L9

L9:                                               ; preds = %top
%3 = sub i64 %unbox, %unbox1
%4 = add i64 %unbox1, 1
%5 = mul i64 %3, %4
%6 = xor i64 %unbox1, -1
%7 = add i64 %unbox, %6
%8 = zext i64 %7 to i65
%9 = zext i64 %3 to i65
%10 = mul i65 %8, %9
%11 = lshr i65 %10, 1
%12 = trunc i65 %11 to i64
%13 = add i64 %unbox1, %5
%14 = add i64 %13, %12
; β
;  @ REPL[22]:6 within `sum1`
br label %L30

L30:                                              ; preds = %L9, %top
%value_phi11 = phi i64 [ 0, %top ], [ %14, %L9 ]
ret i64 %value_phi11
}
``````
2 Likes

Itβs still weird that for larger input values the time grows. I used to show these kind of optimisations years ago and I am sure that older versions of Julia had a constant runtime for all input values of such a naive `sum` implementation. Iβd need to check my old presentationsβ¦

Edit: ah, stupid meβ¦ The example I am talking about was something like

``````function sumup(n)
s = 0
for i in 1:n
s += i
end
s
end
``````

which is using the Gaussian formula and has constant time for all `n` (with some limitations on the correctness for large values

``````julia> @code_native sumup(10)
.text
; β @ REPL[1]:3 within `sumup'
; ββ @ range.jl:5 within `Colon'
; βββ @ range.jl:287 within `UnitRange'
; ββββ @ range.jl:292 within `unitrange_last'
testq	%rdi, %rdi
; ββββ
jle	L43
; β @ REPL[1] within `sumup'
movq	%rdi, %rax
sarq	\$63, %rax
andnq	%rdi, %rax, %rax
; β @ REPL[1]:4 within `sumup'
leaq	-1(%rax), %rdx
leaq	-2(%rax), %rcx
mulxq	%rcx, %rcx, %rdx
shldq	\$63, %rcx, %rdx
leaq	(%rdx,%rax,2), %rax
decq	%rax
; β @ REPL[1]:6 within `sumup'
retq
; β @ REPL[1] within `sumup'
L43:
xorl	%eax, %eax
; β @ REPL[1]:6 within `sumup'
retq
nop
; β
``````

In the OP example, an array is passed.

I am a bit confused or amazed that the compiler is able to figure out that the input array is coming from `collect(1:1000)`, so it knows that it contains a continuous array of numbers. Thatβs kind of fascinating. Whatever happens afterwards is more or less a mystery to me without diving into the generated code much deeper

I believe you are misreading the examples. The OP changes `a` to an actual range towards the end.

Maybe we talk past each other Let me sum up what I mean, maybe I am on the wrong path:

``````julia> function sum1(num)
total = 0
for i in num
total += i
end
end
sum1 (generic function with 1 method)

julia> a = collect(1:1000);

julia> @btime b = sum1(\$a)
144.073 ns (0 allocations: 0 bytes)
500500
``````

Summing up 1000 numbers within 144 is not really physical in my opinion, given that CPU operations are usually `O(ns)`. So I am wondering how/if the compiler knows that `a` is coming from a consecutive list, might might explain that some number magic is happening.

EDIT: so maybe I have a wrong feeling for the CPU instruction times or donβt understand the details. Changing a single element in the input array obviously still gives the right answer, so there is no Gauss Whatever happens there, I donβt understand it (probably some SIMD stuff)β¦

``````julia> a = collect(1:1000);

julia> @btime b = sum1(\$a)
144.081 ns (0 allocations: 0 bytes)
500500

julia> a[123] = -23554
-23554

julia> @btime b = sum1(\$a)
144.101 ns (0 allocations: 0 bytes)
476823
``````

Yes, I think your intuition for CPU speed is slightly off. Iβm not really an expert either but I suspect that these operations are simple enough that they can be performed at nearly one clock cycle each (which is a bit better than a nanosecond) and with 256 bit SIMD four Ints can be summed at the same time.

3 Likes

Yes definitely. With all these SIMD/AVX stuff and branch predictions etc., things get complicated very quickly

Donβt you mean from a range to an array?

No, I donβt.

I have written a blog post on this matter: Boosting Julia Performance: Benchmarking Code in Function, Global Scope, and Builtin

2 Likes

The blog entry is a good idea. Maybe a link to this discussion would be useful, as it provides details that will help further if readers want to analyse their own benchmarking results.

Updated

1 Like