Strange summation timings

NiclasMattsson · August 17, 2018, 7:56pm

I wanted to test different ways of writing a simple summation, here’s the code:

# series with slow convergence to pi
function pisumloop(n)
    s = 0.0
    for k = 1:n
        s += 1 / k^2
    end
    return sqrt(6*s)
end

# vectorization + broadcasting
pisumvec(n) = sqrt(6*sum(1 ./ (1:n).^2))

# array comprehension
pisumcomp(n) = sqrt(6*sum([1/k^2 for k=1:n]))

# generator expression
pisumgen(n) = sqrt(6*sum(1/k^2 for k=1:n))

As you can see from the following benchmark results on Julia 0.6.4, pisumvec and pisumcomp are a bit slower, presumably because they allocate temporary arrays. This is pretty much what I expected, although I thought the difference would be a bit larger.

julia> using BenchmarkTools

julia> @btime pisumloop(10^4);
  37.752 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
  40.094 μs (2 allocations: 78.20 KiB)

julia> @btime pisumcomp(10^4);
  40.094 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
  37.752 μs (4 allocations: 96 bytes)

But here’s what I get on Julia 1.0:

julia> @btime pisumloop(10^4);
  37.752 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
  20.778 μs (3 allocations: 78.23 KiB)

julia> @btime pisumcomp(10^4);
  20.778 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
  37.752 μs (0 allocations: 0 bytes)

What’s going on here? How did pisumvec and pisumcomp become twice as fast in 1.0? And how can they stomp all over pisumloop and pisumgen, which (I thought) should be better?

Another weird thing. All the timings above were done on an old Core i7-4770K, a 4 core machine running at 3.5 GHz. When I tried the same code on an i7-8700K (6 cores at 3.7 GHz, or a single core at 4.7 GHz), all benchmarks finished in only 7-8 μs. Why is there such a huge difference between these machines? From the difference in single core clock speeds I expected the newer machine to be only 25% faster.

stevengj · August 17, 2018, 8:54pm

Change for to @simd for to turn on SIMD optimization for that loop.

Clock speeds have been the least important thing about the CPU for decades now.

Seif_Shebl · August 17, 2018, 9:37pm

Turn on optimization options -O3 --math-mode=fast for the automatic SIMD vectorization to kick in, then the results should be like this:

julia> @btime pisumloop(10^4);
  17.448 μs (0 allocations: 0 bytes)

julia> @btime pisumvec(10^4);
  19.501 μs (3 allocations: 78.23 KiB)

julia> @btime pisumcomp(10^4);
  19.244 μs (2 allocations: 78.20 KiB)

julia> @btime pisumgen(10^4);
  17.448 μs (0 allocations: 0 bytes)

NiclasMattsson · August 17, 2018, 9:50pm

Not for pure single core floating point benchmarks, which my code examples should be. For example, my old machine completes a 2M Superpi 1.5 benchmark in 22.2 seconds and the new one in 16.9 seconds, which is very close to the 25% speedup I expected from raw clock speeds - despite the 5 years and several processor generations between these chips.

But thanks for the SIMD suggestions, I’ll try them out. I still find the (untweaked) difference between array comprehensions and generator expressions unintuitive though. They seem so similar to me.

Topic		Replies	Views
Generators speed New to Julia	5	1342	May 15, 2017
Different speed when estimating pi General Usage question	4	880	December 29, 2017
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1882	October 7, 2019
Help understanding vectorization (or lack thereof) Performance	15	1261	June 8, 2018
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1355	April 17, 2017

Strange summation timings

Related topics