I wanted to test different ways of writing a simple summation, here’s the code:
# series with slow convergence to pi
function pisumloop(n)
s = 0.0
for k = 1:n
s += 1 / k^2
end
return sqrt(6*s)
end
# vectorization + broadcasting
pisumvec(n) = sqrt(6*sum(1 ./ (1:n).^2))
# array comprehension
pisumcomp(n) = sqrt(6*sum([1/k^2 for k=1:n]))
# generator expression
pisumgen(n) = sqrt(6*sum(1/k^2 for k=1:n))
As you can see from the following benchmark results on Julia 0.6.4, pisumvec
and pisumcomp
are a bit slower, presumably because they allocate temporary arrays. This is pretty much what I expected, although I thought the difference would be a bit larger.
julia> using BenchmarkTools
julia> @btime pisumloop(10^4);
37.752 μs (0 allocations: 0 bytes)
julia> @btime pisumvec(10^4);
40.094 μs (2 allocations: 78.20 KiB)
julia> @btime pisumcomp(10^4);
40.094 μs (2 allocations: 78.20 KiB)
julia> @btime pisumgen(10^4);
37.752 μs (4 allocations: 96 bytes)
But here’s what I get on Julia 1.0:
julia> @btime pisumloop(10^4);
37.752 μs (0 allocations: 0 bytes)
julia> @btime pisumvec(10^4);
20.778 μs (3 allocations: 78.23 KiB)
julia> @btime pisumcomp(10^4);
20.778 μs (2 allocations: 78.20 KiB)
julia> @btime pisumgen(10^4);
37.752 μs (0 allocations: 0 bytes)
What’s going on here? How did pisumvec
and pisumcomp
become twice as fast in 1.0? And how can they stomp all over pisumloop
and pisumgen
, which (I thought) should be better?
Another weird thing. All the timings above were done on an old Core i7-4770K, a 4 core machine running at 3.5 GHz. When I tried the same code on an i7-8700K (6 cores at 3.7 GHz, or a single core at 4.7 GHz), all benchmarks finished in only 7-8 μs. Why is there such a huge difference between these machines? From the difference in single core clock speeds I expected the newer machine to be only 25% faster.