Why summing a generator is slightly slower?

Why is this:

@btime (s = 0; for i = 1:10^6 s += i end; s)
#   4.449 ns (0 allocations: 0 bytes)
# 500000500000


significantly faster than this:

@btime sum(i for i = 1:10^6)
#   7.038 ns (0 allocations: 0 bytes)
# 500000500000


Though the difference is small, it’s noticeable.
What is the extra operation that the generator approach is doing?

First, notice that a few ns is much faster than what would be expected for 10^6 ops. In fact, in both cases, the compiler translated the code into a constant-complexity algorithm similar to
\displaystyle\sum_{i=1}^n i = \frac{n(n+1)}{2}

This can be shown by making n vary, and observing that the computing time stays the same.

julia> @btime sum(i for i = 1:1_000_000)
4.316 ns (0 allocations: 0 bytes)
500000500000

julia> @btime sum(i for i = 1:10_000)
3.962 ns (0 allocations: 0 bytes)
50005000

julia> @btime sum(i for i = 1:100)
4.323 ns (0 allocations: 0 bytes)
5050


Now, as to your real question, looking at the output of @code_native for both variants, it looks like one of the function calls in the generator version has not been inlined, whereas the for loop of course involves no function call at all.

This difference would not have been noticeable in any other “regular” case where the compiler would not have known a handy trick to transform a loop into a mere arithmetic formula.

3 Likes

Wow… I had no idea the compiler could be so clever.

How exactly are you using @code_native here? I am wondering because my original examples are not function calls and you cannot apply @code_native directly on them.

Well, I wrap your expressions into functions

function f(n)
s = 0
for i in 1:n
s += i
end
s
end

g(n) = sum(i for i in 1:n)

julia> @btime f(1_000_000)
1.881 ns (0 allocations: 0 bytes)
500000500000

julia> @btime g(1_000_000)
4.108 ns (0 allocations: 0 bytes)
500000500000

julia> @code_native f(10)
.text
; special case: if n==0 return 0
testq   %rdi, %rdi
jle     L32
; general case: n != 0
; this is where the magic happens and the formula is evaluated
leaq    -1(%rdi), %rdx
leaq    -2(%rdi), %rax
mulxq   %rax, %rax, %rcx
shldq   $63, %rax, %rcx leaq (%rcx,%rdi,2), %rax addq$-1, %rax
retq
L32:
xorl    %eax, %eax
retq
nopw    %cs:(%rax,%rax)

julia> @code_native g(10)
... # much longer output, without any mul, but with call

2 Likes

Is that okay, or a bug in the Julia compiler?

I would say it is OK: the function still has to perform a few tests in order to detect whether we are in the n==0 specific case (and these tests are probably a bit more complicated than in the case of the for loop, since they involve the iterator interface implemented by the generator). In light of this, the compiler might very well have decided that inlining the next function call did not present too much of a benefit.

In the general case (and this test case is far from being representative of a real reduction over a collection), I would expect this to have a significant impact only for very small collections (let’s say less than 10 elements).

1 Like