First, notice that a few ns is much faster than what would be expected for 10^6 ops. In fact, in both cases, the compiler translated the code into a constant-complexity algorithm similar to \displaystyle\sum_{i=1}^n i = \frac{n(n+1)}{2}
This can be shown by making n vary, and observing that the computing time stays the same.
julia> @btime sum(i for i = 1:1_000_000)
4.316 ns (0 allocations: 0 bytes)
500000500000
julia> @btime sum(i for i = 1:10_000)
3.962 ns (0 allocations: 0 bytes)
50005000
julia> @btime sum(i for i = 1:100)
4.323 ns (0 allocations: 0 bytes)
5050
Now, as to your real question, looking at the output of @code_native for both variants, it looks like one of the function calls in the generator version has not been inlined, whereas the for loop of course involves no function call at all.
This difference would not have been noticeable in any other “regular” case where the compiler would not have known a handy trick to transform a loop into a mere arithmetic formula.
Wow… I had no idea the compiler could be so clever.
How exactly are you using @code_native here? I am wondering because my original examples are not function calls and you cannot apply @code_native directly on them.
I would say it is OK: the function still has to perform a few tests in order to detect whether we are in the n==0 specific case (and these tests are probably a bit more complicated than in the case of the for loop, since they involve the iterator interface implemented by the generator). In light of this, the compiler might very well have decided that inlining the next function call did not present too much of a benefit.
In the general case (and this test case is far from being representative of a real reduction over a collection), I would expect this to have a significant impact only for very small collections (let’s say less than 10 elements).