# Strange summation timings

I wanted to test different ways of writing a simple summation, here’s the code:

``````# series with slow convergence to pi
function pisumloop(n)
s = 0.0
for k = 1:n
s += 1 / k^2
end
return sqrt(6*s)
end

pisumvec(n) = sqrt(6*sum(1 ./ (1:n).^2))

# array comprehension
pisumcomp(n) = sqrt(6*sum([1/k^2 for k=1:n]))

# generator expression
pisumgen(n) = sqrt(6*sum(1/k^2 for k=1:n))
``````

As you can see from the following benchmark results on Julia 0.6.4, `pisumvec` and `pisumcomp` are a bit slower, presumably because they allocate temporary arrays. This is pretty much what I expected, although I thought the difference would be a bit larger.

``````julia> using BenchmarkTools

julia> @btime pisumloop(10^4);
37.752 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
40.094 μs (2 allocations: 78.20 KiB)

julia> @btime pisumcomp(10^4);
40.094 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
37.752 μs (4 allocations: 96 bytes)
``````

But here’s what I get on Julia 1.0:

``````julia> @btime pisumloop(10^4);
37.752 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
20.778 μs (3 allocations: 78.23 KiB)

julia> @btime pisumcomp(10^4);
20.778 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
37.752 μs (0 allocations: 0 bytes)
``````

What’s going on here? How did `pisumvec` and `pisumcomp` become twice as fast in 1.0? And how can they stomp all over `pisumloop` and `pisumgen`, which (I thought) should be better?

Another weird thing. All the timings above were done on an old Core i7-4770K, a 4 core machine running at 3.5 GHz. When I tried the same code on an i7-8700K (6 cores at 3.7 GHz, or a single core at 4.7 GHz), all benchmarks finished in only 7-8 μs. Why is there such a huge difference between these machines? From the difference in single core clock speeds I expected the newer machine to be only 25% faster.

Change `for` to `@simd for` to turn on SIMD optimization for that loop.

Clock speeds have been the least important thing about the CPU for decades now.

Turn on optimization options `-O3 --math-mode=fast` for the automatic SIMD vectorization to kick in, then the results should be like this:

``````julia> @btime pisumloop(10^4);
17.448 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
19.501 μs (3 allocations: 78.23 KiB)

julia> @btime pisumcomp(10^4);
19.244 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
17.448 μs (0 allocations: 0 bytes)
``````

Not for pure single core floating point benchmarks, which my code examples should be. For example, my old machine completes a 2M Superpi 1.5 benchmark in 22.2 seconds and the new one in 16.9 seconds, which is very close to the 25% speedup I expected from raw clock speeds - despite the 5 years and several processor generations between these chips.

But thanks for the SIMD suggestions, I’ll try them out. I still find the (untweaked) difference between array comprehensions and generator expressions unintuitive though. They seem so similar to me.