Hi,

```
julia> N = 10000000; A = randn(N); @time sum(A); @time sum(A[i] for i in 1:N); @time sum(a for a in A)
0.007686 seconds (5 allocations: 176 bytes)
0.640131 seconds (30.02 M allocations: 458.566 MB, 13.64% gc time)
0.039558 seconds (16.76 k allocations: 713.857 KB)
```

in both 0.5 and 0.6rc. Am I wrong in thinking the second and third versions should also be fast, based on the paragraph “Generators” of https://julialang.org/blog/2016/10/julia-0.5-highlights?

You should not benchmark in global scope.

```
using BenchmarkTools
N = 10000000
A = randn(N)
sum1(A) = sum(A)
sum2(A) = sum(A[i] for i in 1:N)
sum3(A) = sum(a for a in A)
```

then

```
julia> @benchmark sum1($A)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.576 ms (0.00% GC)
median time: 5.744 ms (0.00% GC)
mean time: 5.672 ms (0.00% GC)
maximum time: 6.901 ms (0.00% GC)
--------------
samples: 881
evals/sample: 1
julia> @benchmark sum2($A)
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 5
--------------
minimum time: 12.764 ms (0.00% GC)
median time: 12.885 ms (0.00% GC)
mean time: 12.972 ms (0.00% GC)
maximum time: 16.242 ms (0.00% GC)
--------------
samples: 386
evals/sample: 1
julia> @benchmark sum3($A)
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 3
--------------
minimum time: 12.761 ms (0.00% GC)
median time: 12.874 ms (0.00% GC)
mean time: 12.941 ms (0.00% GC)
maximum time: 16.360 ms (0.00% GC)
--------------
samples: 387
evals/sample: 1
```

so the difference is 2-3x. Above using `v0.6-rc1`

.

1 Like

FWIW `sum2`

is still using global variables…

1 Like

That’s right, sorry about that! I was extracting the code from a function that had weird performance and forgot about the (very annoying) globals problem. So here’s a more interesting benchmark, closer to what I’m actually doing:

```
using BenchmarkTools
const N = 10000000
A = randn(N)
sum1(A) = sum(A.*A.+A)
sum2(A) = sum(A.*A+A)
sum3(A) = sum(A[i]*A[i]+A[i] for i in 1:N)
sum4(A) = sum(a*a+a for a in A)
function sum5(A)
res = 0.0
for i = 1:N
res += A[i]*A[i]+A[i]
end
end
BenchmarkTools.DEFAULT_PARAMETERS.samples = 10
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2
@btime sum1($A)
@btime sum2($A)
@btime sum3($A)
@btime sum4($A)
@btime sum5($A)
```

I get

```
36.951 ms (2 allocations: 76.29 MiB)
80.213 ms (4 allocations: 152.59 MiB)
12.607 ms (4 allocations: 80 bytes)
12.628 ms (3 allocations: 48 bytes)
4.011 ms (0 allocations: 0 bytes)
```

I understand the first two timings (they imply a vector allocation), but not the third and fourth. Is any of these timings considered a bug? Does it mean I’m condemned to boring loops?

How fast is your `sum5`

if you actually return the result?

This is maybe too specialized to generalize to your real problem but you can also compare

```
sum6(A) = sum(x -> x * x + x, A)
```

That’s a good point… then the speed difference vanishes. I also included an `@inbounds @simd`

version of the loop. The final benchmark is

```
using BenchmarkTools
const N = 10000000
A = randn(N)
sum1(A) = sum(A.*A.+A)
sum2(A) = sum(A.*A+A)
sum3(A) = sum(A[i]*A[i]+A[i] for i in 1:N)
sum4(A) = sum(a*a+a for a in A)
function sum5(A)
res = 0.0
for i = 1:N
res += A[i]*A[i]+A[i]
end
res
end
function sum6(A)
res = 0.0
@inbounds @simd for i = 1:N
res += A[i]*A[i]+A[i]
end
res
end
sum7(A) = sum(x -> x * x + x, A)
BenchmarkTools.DEFAULT_PARAMETERS.samples = 10
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2
@btime sum1($A)
@btime sum2($A)
@btime sum3($A)
@btime sum4($A)
@btime sum5($A)
@btime sum6($A)
@btime sum7($A)
```

with result (0.6 rc)

```
38.080 ms (2 allocations: 76.29 MiB)
84.583 ms (4 allocations: 152.59 MiB)
12.619 ms (4 allocations: 80 bytes)
12.606 ms (3 allocations: 48 bytes)
12.629 ms (0 allocations: 0 bytes)
7.230 ms (0 allocations: 0 bytes)
7.297 ms (0 allocations: 0 bytes)
```

(neither `@inbounds`

nor `@simd`

by itself is able to get the 7ms, both are needed)

Amazingly, sum(x -> x * x + x, A) is able to SIMD automatically! I don’t understand how that can happen: looking at the implementation, it falls back to mapfoldl_impl, which is a simple while loop…