julia> using BenchmarkTools
julia> function sum1(num)
total = 0
for i in num
total += i
end
return total
end
sum1 (generic function with 1 method)
julia> a = collect(1:1000);
julia> @btime b = sum(a)
99.192 ns (1 allocation: 16 bytes)
500500
julia> @btime b = sum1($a)
71.325 ns (0 allocations: 0 bytes)
500500
julia> @btime b = sum1($a)
71.288 ns (0 allocations: 0 bytes)
500500
julia> @btime b = sum(a)
99.042 ns (1 allocation: 16 bytes)
500500
julia> @btime b = sum(a)
99.042 ns (1 allocation: 16 bytes)
500500
julia> @btime b = sum1($a)
71.180 ns (0 allocations: 0 bytes)
500500
julia> @btime b = sum1($a)
71.434 ns (0 allocations: 0 bytes)
500500
julia> @btime b = sum1($a)
71.288 ns (0 allocations: 0 bytes)
500500
julia> a = 1:1000
1:1000
julia> @btime b = sum1($a)
2.245 ns (0 allocations: 0 bytes)
500500
julia> @btime b = sum(a)
29.193 ns (1 allocation: 16 bytes)
500500

Why the function sum1 is faster than builtin sum function? I just expected the opposite.
Why the functions sum1 and sum are faster when input is just a = 1:1000 ?

You seem to be interpolating a at some points and not at others, this could probably affect timing.

When the input is a range type I guess the standard sum dispatches to a method using some summation formula, e.g. S(n)=n*(n+1)/2, or similar. Not sure why sum1 also improves that much, 2ns seems like the compiler might have calculated it beforehand?

I confirm the difference, but itβs not really consistent (see below), but notice that the execution time varies for different values, so I donβt think it (only) does the famous shortcut via the Gaussian formula, which should run with the same speed for all integers and moreover, it should be a bit faster (itβs just a handful of operations, so it should run within less than 30ns or so). For bigger numbers itβs getting slower and slower.

The allocation in the OP call comes from the missing interpolation as @albheim pointed out.
The input is not a range type btw. since the range is passed to collect().

The naive summation loop is about as fast as you can do it with a single thread, given automatic SIMD optimizations. Apart from the interpolation difference it looks like Base.sum is still somewhat slower and there should be some minor optimization opportunities in the maze of mapreduce machinery that sum enters.

LLVM is very good at recognizing this specific loop structure and optimizing it to the closed form solution. You can see here that itβs multiplying things rather than doing a summation loop:

Itβs still weird that for larger input values the time grows. I used to show these kind of optimisations years ago and I am sure that older versions of Julia had a constant runtime for all input values of such a naive sum implementation. Iβd need to check my old presentationsβ¦

Edit: ah, stupid meβ¦ The example I am talking about was something like

function sumup(n)
s = 0
for i in 1:n
s += i
end
s
end

which is using the Gaussian formula and has constant time for all n (with some limitations on the correctness for large values

I am a bit confused or amazed that the compiler is able to figure out that the input array is coming from collect(1:1000), so it knows that it contains a continuous array of numbers. Thatβs kind of fascinating. Whatever happens afterwards is more or less a mystery to me without diving into the generated code much deeper

Maybe we talk past each other Let me sum up what I mean, maybe I am on the wrong path:

julia> function sum1(num)
total = 0
for i in num
total += i
end
return total
end
sum1 (generic function with 1 method)
julia> a = collect(1:1000);
julia> @btime b = sum1($a)
144.073 ns (0 allocations: 0 bytes)
500500

Summing up 1000 numbers within 144 is not really physical in my opinion, given that CPU operations are usually O(ns). So I am wondering how/if the compiler knows that a is coming from a consecutive list, might might explain that some number magic is happening.

EDIT: so maybe I have a wrong feeling for the CPU instruction times or donβt understand the details. Changing a single element in the input array obviously still gives the right answer, so there is no Gauss Whatever happens there, I donβt understand it (probably some SIMD stuff)β¦

Yes, I think your intuition for CPU speed is slightly off. Iβm not really an expert either but I suspect that these operations are simple enough that they can be performed at nearly one clock cycle each (which is a bit better than a nanosecond) and with 256 bit SIMD four Ints can be summed at the same time.

The blog entry is a good idea. Maybe a link to this discussion would be useful, as it provides details that will help further if readers want to analyse their own benchmarking results.