I wanted to test different ways of writing a simple summation, here’s the code:
# series with slow convergence to pi
function pisumloop(n)
s = 0.0
for k = 1:n
s += 1 / k^2
end
return sqrt(6*s)
end
# vectorization + broadcasting
pisumvec(n) = sqrt(6*sum(1 ./ (1:n).^2))
# array comprehension
pisumcomp(n) = sqrt(6*sum([1/k^2 for k=1:n]))
# generator expression
pisumgen(n) = sqrt(6*sum(1/k^2 for k=1:n))
As you can see from the following benchmark results on Julia 0.6.4, pisumvec and pisumcomp are a bit slower, presumably because they allocate temporary arrays. This is pretty much what I expected, although I thought the difference would be a bit larger.
julia> using BenchmarkTools
julia> @btime pisumloop(10^4);
37.752 μs (0 allocations: 0 bytes)
julia> @btime pisumvec(10^4);
40.094 μs (2 allocations: 78.20 KiB)
julia> @btime pisumcomp(10^4);
40.094 μs (2 allocations: 78.20 KiB)
julia> @btime pisumgen(10^4);
37.752 μs (4 allocations: 96 bytes)
But here’s what I get on Julia 1.0:
julia> @btime pisumloop(10^4);
37.752 μs (0 allocations: 0 bytes)
julia> @btime pisumvec(10^4);
20.778 μs (3 allocations: 78.23 KiB)
julia> @btime pisumcomp(10^4);
20.778 μs (2 allocations: 78.20 KiB)
julia> @btime pisumgen(10^4);
37.752 μs (0 allocations: 0 bytes)
What’s going on here? How did pisumvec and pisumcomp become twice as fast in 1.0? And how can they stomp all over pisumloop and pisumgen, which (I thought) should be better?
Another weird thing. All the timings above were done on an old Core i7-4770K, a 4 core machine running at 3.5 GHz. When I tried the same code on an i7-8700K (6 cores at 3.7 GHz, or a single core at 4.7 GHz), all benchmarks finished in only 7-8 μs. Why is there such a huge difference between these machines? From the difference in single core clock speeds I expected the newer machine to be only 25% faster.