Strange summation timings

I wanted to test different ways of writing a simple summation, here’s the code:

# series with slow convergence to pi
function pisumloop(n)
    s = 0.0
    for k = 1:n
        s += 1 / k^2
    end
    return sqrt(6*s)
end

# vectorization + broadcasting
pisumvec(n) = sqrt(6*sum(1 ./ (1:n).^2))

# array comprehension
pisumcomp(n) = sqrt(6*sum([1/k^2 for k=1:n]))

# generator expression
pisumgen(n) = sqrt(6*sum(1/k^2 for k=1:n))

As you can see from the following benchmark results on Julia 0.6.4, pisumvec and pisumcomp are a bit slower, presumably because they allocate temporary arrays. This is pretty much what I expected, although I thought the difference would be a bit larger.

julia> using BenchmarkTools

julia> @btime pisumloop(10^4);
  37.752 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
  40.094 μs (2 allocations: 78.20 KiB)

julia> @btime pisumcomp(10^4);
  40.094 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
  37.752 μs (4 allocations: 96 bytes)

But here’s what I get on Julia 1.0:

julia> @btime pisumloop(10^4);
  37.752 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
  20.778 μs (3 allocations: 78.23 KiB)

julia> @btime pisumcomp(10^4);
  20.778 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
  37.752 μs (0 allocations: 0 bytes)

What’s going on here? How did pisumvec and pisumcomp become twice as fast in 1.0? And how can they stomp all over pisumloop and pisumgen, which (I thought) should be better?

Another weird thing. All the timings above were done on an old Core i7-4770K, a 4 core machine running at 3.5 GHz. When I tried the same code on an i7-8700K (6 cores at 3.7 GHz, or a single core at 4.7 GHz), all benchmarks finished in only 7-8 μs. Why is there such a huge difference between these machines? From the difference in single core clock speeds I expected the newer machine to be only 25% faster.

1 Like

Change for to @simd for to turn on SIMD optimization for that loop.

Clock speeds have been the least important thing about the CPU for decades now.

Turn on optimization options -O3 --math-mode=fast for the automatic SIMD vectorization to kick in, then the results should be like this:

julia> @btime pisumloop(10^4);
  17.448 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
  19.501 μs (3 allocations: 78.23 KiB)

julia> @btime pisumcomp(10^4);
  19.244 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
  17.448 μs (0 allocations: 0 bytes)

Not for pure single core floating point benchmarks, which my code examples should be. For example, my old machine completes a 2M Superpi 1.5 benchmark in 22.2 seconds and the new one in 16.9 seconds, which is very close to the 25% speedup I expected from raw clock speeds - despite the 5 years and several processor generations between these chips.

But thanks for the SIMD suggestions, I’ll try them out. I still find the (untweaked) difference between array comprehensions and generator expressions unintuitive though. They seem so similar to me.