Why summing a generator is slightly slower?

e3c6 · April 28, 2019, 8:51pm

Why is this:

@btime (s = 0; for i = 1:10^6 s += i end; s)
#   4.449 ns (0 allocations: 0 bytes)
# 500000500000

significantly faster than this:

@btime sum(i for i = 1:10^6)
#   7.038 ns (0 allocations: 0 bytes)
# 500000500000

Though the difference is small, it’s noticeable.
What is the extra operation that the generator approach is doing?

ffevotte · April 28, 2019, 9:35pm

First, notice that a few ns is much faster than what would be expected for 10^6 ops. In fact, in both cases, the compiler translated the code into a constant-complexity algorithm similar to
\displaystyle\sum_{i=1}^n i = \frac{n(n+1)}{2}

This can be shown by making n vary, and observing that the computing time stays the same.

julia> @btime sum(i for i = 1:1_000_000)
  4.316 ns (0 allocations: 0 bytes)
500000500000

julia> @btime sum(i for i = 1:10_000)
  3.962 ns (0 allocations: 0 bytes)
50005000

julia> @btime sum(i for i = 1:100)
  4.323 ns (0 allocations: 0 bytes)
5050

Now, as to your real question, looking at the output of @code_native for both variants, it looks like one of the function calls in the generator version has not been inlined, whereas the for loop of course involves no function call at all.

This difference would not have been noticeable in any other “regular” case where the compiler would not have known a handy trick to transform a loop into a mere arithmetic formula.

e3c6 · April 28, 2019, 9:47pm

Wow… I had no idea the compiler could be so clever.

How exactly are you using @code_native here? I am wondering because my original examples are not function calls and you cannot apply @code_native directly on them.

ffevotte · April 28, 2019, 9:54pm

Well, I wrap your expressions into functions

function f(n)
    s = 0
    for i in 1:n
        s += i
    end
    s
end

g(n) = sum(i for i in 1:n)

julia> @btime f(1_000_000)
  1.881 ns (0 allocations: 0 bytes)
500000500000

julia> @btime g(1_000_000)
  4.108 ns (0 allocations: 0 bytes)
500000500000

julia> @code_native f(10)
        .text
; special case: if n==0 return 0
        testq   %rdi, %rdi
        jle     L32
; general case: n != 0
; this is where the magic happens and the formula is evaluated
        leaq    -1(%rdi), %rdx
        leaq    -2(%rdi), %rax
        mulxq   %rax, %rax, %rcx
        shldq   $63, %rax, %rcx
        leaq    (%rcx,%rdi,2), %rax
        addq    $-1, %rax
        retq
L32:
        xorl    %eax, %eax
        retq
        nopw    %cs:(%rax,%rax)

julia> @code_native g(10)
... # much longer output, without any `mul`, but with `call`

e3c6 · April 28, 2019, 10:16pm

Is that okay, or a bug in the Julia compiler?

ffevotte · April 28, 2019, 10:24pm

I would say it is OK: the function still has to perform a few tests in order to detect whether we are in the n==0 specific case (and these tests are probably a bit more complicated than in the case of the for loop, since they involve the iterator interface implemented by the generator). In light of this, the compiler might very well have decided that inlining the next function call did not present too much of a benefit.

In the general case (and this test case is far from being representative of a real reduction over a collection), I would expect this to have a significant impact only for very small collections (let’s say less than 10 elements).

e3c6 · April 28, 2019, 10:26pm

@ffevotte Thanks so much for your answers.

ffevotte · April 29, 2019, 9:55am

A colleague of mine just sent me the following link, which is relevant here:

Topic		Replies	Views
Generators speed New to Julia	5	1316	May 15, 2017
Sum over BigInt better performance without generator Performance	3	536	May 31, 2020
Different speed when estimating pi General Usage question	4	858	December 29, 2017
Simple conditional loop Performance question	4	422	December 7, 2020
Strange summation timings General Usage	3	418	August 17, 2018

Why summing a generator is slightly slower?

Related topics