Why summing a generator is slightly slower?

Why is this:

@btime (s = 0; for i = 1:10^6 s += i end; s)
#   4.449 ns (0 allocations: 0 bytes)
# 500000500000

significantly faster than this:

@btime sum(i for i = 1:10^6)
#   7.038 ns (0 allocations: 0 bytes)
# 500000500000

Though the difference is small, it’s noticeable.
What is the extra operation that the generator approach is doing?

First, notice that a few ns is much faster than what would be expected for 10^6 ops. In fact, in both cases, the compiler translated the code into a constant-complexity algorithm similar to
\displaystyle\sum_{i=1}^n i = \frac{n(n+1)}{2}

This can be shown by making n vary, and observing that the computing time stays the same.

julia> @btime sum(i for i = 1:1_000_000)
  4.316 ns (0 allocations: 0 bytes)

julia> @btime sum(i for i = 1:10_000)
  3.962 ns (0 allocations: 0 bytes)

julia> @btime sum(i for i = 1:100)
  4.323 ns (0 allocations: 0 bytes)

Now, as to your real question, looking at the output of @code_native for both variants, it looks like one of the function calls in the generator version has not been inlined, whereas the for loop of course involves no function call at all.

This difference would not have been noticeable in any other “regular” case where the compiler would not have known a handy trick to transform a loop into a mere arithmetic formula.


Wow… I had no idea the compiler could be so clever.

How exactly are you using @code_native here? I am wondering because my original examples are not function calls and you cannot apply @code_native directly on them.

Well, I wrap your expressions into functions :slight_smile:

function f(n)
    s = 0
    for i in 1:n
        s += i

g(n) = sum(i for i in 1:n)
julia> @btime f(1_000_000)
  1.881 ns (0 allocations: 0 bytes)

julia> @btime g(1_000_000)
  4.108 ns (0 allocations: 0 bytes)
julia> @code_native f(10)
; special case: if n==0 return 0
        testq   %rdi, %rdi
        jle     L32
; general case: n != 0
; this is where the magic happens and the formula is evaluated
        leaq    -1(%rdi), %rdx
        leaq    -2(%rdi), %rax
        mulxq   %rax, %rax, %rcx
        shldq   $63, %rax, %rcx
        leaq    (%rcx,%rdi,2), %rax
        addq    $-1, %rax
        xorl    %eax, %eax
        nopw    %cs:(%rax,%rax)
julia> @code_native g(10)
... # much longer output, without any `mul`, but with `call`

Is that okay, or a bug in the Julia compiler?

I would say it is OK: the function still has to perform a few tests in order to detect whether we are in the n==0 specific case (and these tests are probably a bit more complicated than in the case of the for loop, since they involve the iterator interface implemented by the generator). In light of this, the compiler might very well have decided that inlining the next function call did not present too much of a benefit.

In the general case (and this test case is far from being representative of a real reduction over a collection), I would expect this to have a significant impact only for very small collections (let’s say less than 10 elements).

1 Like

@ffevotte Thanks so much for your answers.

A colleague of mine just sent me the following link, which is relevant here: