Why the function sum1 is faster than builtin sum

VinodV · September 20, 2023, 5:16am

julia> using BenchmarkTools

julia> function sum1(num)
         total = 0
         for i in num
            total += i
         end
         return total
       end
sum1 (generic function with 1 method)


julia> a = collect(1:1000);

julia> @btime b = sum(a)
  99.192 ns (1 allocation: 16 bytes)
500500

julia> @btime b = sum1($a)
  71.325 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  71.288 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum(a)
  99.042 ns (1 allocation: 16 bytes)
500500

julia> @btime b = sum(a)
  99.042 ns (1 allocation: 16 bytes)
500500

julia> @btime b = sum1($a)
  71.180 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  71.434 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  71.288 ns (0 allocations: 0 bytes)
500500

julia> a = 1:1000
1:1000

julia> @btime b = sum1($a)
  2.245 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum(a)
  29.193 ns (1 allocation: 16 bytes)
500500

Why the function sum1 is faster than builtin sum function? I just expected the opposite.
Why the functions sum1 and sum are faster when input is just a = 1:1000 ?

albheim · September 20, 2023, 5:28am

You seem to be interpolating a at some points and not at others, this could probably affect timing.

When the input is a range type I guess the standard sum dispatches to a method using some summation formula, e.g. S(n)=n*(n+1)/2, or similar. Not sure why sum1 also improves that much, 2ns seems like the compiler might have calculated it beforehand?

tamasgal · September 20, 2023, 5:46am

I confirm the difference, but it’s not really consistent (see below), but notice that the execution time varies for different values, so I don’t think it (only) does the famous shortcut via the Gaussian formula, which should run with the same speed for all integers and moreover, it should be a bit faster (it’s just a handful of operations, so it should run within less than 30ns or so). For bigger numbers it’s getting slower and slower.

The allocation in the OP call comes from the missing interpolation as @albheim pointed out.
The input is not a range type btw. since the range is passed to collect().

The code is different for both methods:

julia> a = collect(1:1000);

julia> @btime b = sum($a)
[ Info: Loading BenchmarkTools ...
  122.924 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  144.073 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  144.073 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum($a)
  122.924 ns (0 allocations: 0 bytes)
500500

julia> a = collect(1:10000);

julia> @btime b = sum($a)
  1.200 μs (0 allocations: 0 bytes)
50005000

julia> @btime b = sum1($a)
  1.521 μs (0 allocations: 0 bytes)
50005000

julia> a = collect(1:100000);

julia> @btime b = sum($a)
  12.333 μs (0 allocations: 0 bytes)
5000050000

julia> @btime b = sum1($a)
  15.500 μs (0 allocations: 0 bytes)
5000050000

julia> @code_warntype sum1(a)
MethodInstance for sum1(::Vector{Int64})
  from sum1(num) @ Main REPL[1]:1
Arguments
  #self#::Core.Const(sum1)
  num::Vector{Int64}
Locals
  @_3::Union{Nothing, Tuple{Int64, Int64}}
  total::Int64
  i::Int64
Body::Int64
1 ─       (total = 0)
│   %2  = num::Vector{Int64}
│         (@_3 = Base.iterate(%2))
│   %4  = (@_3 === nothing)::Bool
│   %5  = Base.not_int(%4)::Bool
└──       goto #4 if not %5
2 ┄ %7  = @_3::Tuple{Int64, Int64}
│         (i = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Int64
│         (total = total + i)
│         (@_3 = Base.iterate(%2, %9))
│   %12 = (@_3 === nothing)::Bool
│   %13 = Base.not_int(%12)::Bool
└──       goto #4 if not %13
3 ─       goto #2
4 ┄       return total


julia> @code_warntype sum(a)
MethodInstance for sum(::Vector{Int64})
  from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:994
Arguments
  #self#::Core.Const(sum)
  a::Vector{Int64}
Body::Int64
1 ─      nothing
│   %2 = Base.:(:)::Core.Const(Colon())
│   %3 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %4 = Base.pairs(%3)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
│   %5 = Base.:(var"#sum#807")(%2, %4, #self#, a)::Int64
└──      return %5

julia> a = collect(1:1000);

julia> @btime b = sum($a)
  122.924 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  121.961 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum($a)
  122.924 ns (0 allocations: 0 bytes)
500500

julia> @btime b = sum1($a)
  121.961 ns (0 allocations: 0 bytes)
500500

VinodV · September 20, 2023, 5:47am

You are spot on. My mistake. Interpolating b = sum($a) is speeding up the code.

GunnarFarneback · September 20, 2023, 7:15am

The naive summation loop is about as fast as you can do it with a single thread, given automatic SIMD optimizations. Apart from the interpolation difference it looks like Base.sum is still somewhat slower and there should be some minor optimization opportunities in the maze of mapreduce machinery that sum enters.

LLVM is very good at recognizing this specific loop structure and optimizing it to the closed form solution. You can see here that it’s multiplying things rather than doing a summation loop:

julia> @code_llvm sum1(1:1000)
;  @ REPL[22]:1 within `sum1`
define i64 @julia_sum1_462([2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
;  @ REPL[22]:3 within `sum1`
; ┌ @ range.jl:897 within `iterate`
; │┌ @ range.jl:672 within `isempty`
; ││┌ @ range.jl:834 within `first`
; │││┌ @ Base.jl:37 within `getproperty`
      %1 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
; ││└└
; ││┌ @ range.jl:839 within `last`
; │││┌ @ Base.jl:37 within `getproperty`
      %2 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
; ││└└
; ││┌ @ operators.jl:376 within `>`
; │││┌ @ int.jl:83 within `<`
      %unbox = load i64, i64* %2, align 8
      %unbox1 = load i64, i64* %1, align 8
      %.not = icmp slt i64 %unbox, %unbox1
; │└└└
   br i1 %.not, label %L30, label %L9

L9:                                               ; preds = %top
   %3 = sub i64 %unbox, %unbox1
   %4 = add i64 %unbox1, 1
   %5 = mul i64 %3, %4
   %6 = xor i64 %unbox1, -1
   %7 = add i64 %unbox, %6
   %8 = zext i64 %7 to i65
   %9 = zext i64 %3 to i65
   %10 = mul i65 %8, %9
   %11 = lshr i65 %10, 1
   %12 = trunc i65 %11 to i64
   %13 = add i64 %unbox1, %5
   %14 = add i64 %13, %12
; └
;  @ REPL[22]:6 within `sum1`
  br label %L30

L30:                                              ; preds = %L9, %top
  %value_phi11 = phi i64 [ 0, %top ], [ %14, %L9 ]
  ret i64 %value_phi11
}

tamasgal · September 20, 2023, 7:56am

It’s still weird that for larger input values the time grows. I used to show these kind of optimisations years ago and I am sure that older versions of Julia had a constant runtime for all input values of such a naive sum implementation. I’d need to check my old presentations…

Edit: ah, stupid me… The example I am talking about was something like

function sumup(n)
  s = 0
  for i in 1:n
    s += i
  end
  s
end

which is using the Gaussian formula and has constant time for all n (with some limitations on the correctness for large values

julia> @code_native sumup(10)
	.text
; ┌ @ REPL[1]:3 within `sumup'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:287 within `UnitRange'
; │││┌ @ range.jl:292 within `unitrange_last'
	testq	%rdi, %rdi
; │└└└
	jle	L43
; │ @ REPL[1] within `sumup'
	movq	%rdi, %rax
	sarq	$63, %rax
	andnq	%rdi, %rax, %rax
; │ @ REPL[1]:4 within `sumup'
	leaq	-1(%rax), %rdx
	leaq	-2(%rax), %rcx
	mulxq	%rcx, %rcx, %rdx
	shldq	$63, %rcx, %rdx
	leaq	(%rdx,%rax,2), %rax
	decq	%rax
; │ @ REPL[1]:6 within `sumup'
	retq
; │ @ REPL[1] within `sumup'
L43:
	xorl	%eax, %eax
; │ @ REPL[1]:6 within `sumup'
	retq
	nop
; └

In the OP example, an array is passed.

I am a bit confused or amazed that the compiler is able to figure out that the input array is coming from collect(1:1000), so it knows that it contains a continuous array of numbers. That’s kind of fascinating. Whatever happens afterwards is more or less a mystery to me without diving into the generated code much deeper

GunnarFarneback · September 20, 2023, 8:15am

I believe you are misreading the examples. The OP changes a to an actual range towards the end.

tamasgal · September 20, 2023, 8:24am

Maybe we talk past each other Let me sum up what I mean, maybe I am on the wrong path:

julia> function sum1(num)
         total = 0
         for i in num
            total += i
         end
         return total
       end
sum1 (generic function with 1 method)


julia> a = collect(1:1000);

julia> @btime b = sum1($a)
  144.073 ns (0 allocations: 0 bytes)
500500

Summing up 1000 numbers within 144 is not really physical in my opinion, given that CPU operations are usually O(ns). So I am wondering how/if the compiler knows that a is coming from a consecutive list, might might explain that some number magic is happening.

EDIT: so maybe I have a wrong feeling for the CPU instruction times or don’t understand the details. Changing a single element in the input array obviously still gives the right answer, so there is no Gauss Whatever happens there, I don’t understand it (probably some SIMD stuff)…

julia> a = collect(1:1000);

julia> @btime b = sum1($a)
[ Info: Loading BenchmarkTools ...
  144.081 ns (0 allocations: 0 bytes)
500500

julia> a[123] = -23554
-23554

julia> @btime b = sum1($a)
  144.101 ns (0 allocations: 0 bytes)
476823

GunnarFarneback · September 20, 2023, 8:37am

Yes, I think your intuition for CPU speed is slightly off. I’m not really an expert either but I suspect that these operations are simple enough that they can be performed at nearly one clock cycle each (which is a bit better than a nanosecond) and with 256 bit SIMD four Ints can be summed at the same time.

tamasgal · September 20, 2023, 8:52am

Yes definitely. With all these SIMD/AVX stuff and branch predictions etc., things get complicated very quickly

DNF · September 20, 2023, 9:14am

Don’t you mean from a range to an array?

GunnarFarneback · September 20, 2023, 10:54am

No, I don’t.

VinodV · September 21, 2023, 6:26am

I have written a blog post on this matter: Boosting Julia Performance: Benchmarking Code in Function, Global Scope, and Builtin

magister-ludi · September 21, 2023, 7:51am

The blog entry is a good idea. Maybe a link to this discussion would be useful, as it provides details that will help further if readers want to analyse their own benchmarking results.

VinodV · September 21, 2023, 8:17am

Updated

Topic		Replies	Views
A fast sum. Any downsides? Performance sum	18	973	December 16, 2024
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1796	October 7, 2019
How to improve performance of sum() Performance question , fast-math , tullio , loopvectorization , sum	19	4866	July 19, 2021
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7052	January 29, 2019
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017

Why the function sum1 is faster than builtin sum

Related topics