Hi,
julia> N = 10000000; A = randn(N); @time sum(A); @time sum(A[i] for i in 1:N); @time sum(a for a in A)
0.007686 seconds (5 allocations: 176 bytes)
0.640131 seconds (30.02 M allocations: 458.566 MB, 13.64% gc time)
0.039558 seconds (16.76 k allocations: 713.857 KB)
in both 0.5 and 0.6rc. Am I wrong in thinking the second and third versions should also be fast, based on the paragraph “Generators” of Julia 0.5 Highlights?
You should not benchmark in global scope.
using BenchmarkTools
N = 10000000
A = randn(N)
sum1(A) = sum(A)
sum2(A) = sum(A[i] for i in 1:N)
sum3(A) = sum(a for a in A)
then
julia> @benchmark sum1($A)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.576 ms (0.00% GC)
median time: 5.744 ms (0.00% GC)
mean time: 5.672 ms (0.00% GC)
maximum time: 6.901 ms (0.00% GC)
--------------
samples: 881
evals/sample: 1
julia> @benchmark sum2($A)
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 5
--------------
minimum time: 12.764 ms (0.00% GC)
median time: 12.885 ms (0.00% GC)
mean time: 12.972 ms (0.00% GC)
maximum time: 16.242 ms (0.00% GC)
--------------
samples: 386
evals/sample: 1
julia> @benchmark sum3($A)
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 3
--------------
minimum time: 12.761 ms (0.00% GC)
median time: 12.874 ms (0.00% GC)
mean time: 12.941 ms (0.00% GC)
maximum time: 16.360 ms (0.00% GC)
--------------
samples: 387
evals/sample: 1
so the difference is 2-3x. Above using v0.6-rc1
.
1 Like
FWIW sum2
is still using global variables…
1 Like
That’s right, sorry about that! I was extracting the code from a function that had weird performance and forgot about the (very annoying) globals problem. So here’s a more interesting benchmark, closer to what I’m actually doing:
using BenchmarkTools
const N = 10000000
A = randn(N)
sum1(A) = sum(A.*A.+A)
sum2(A) = sum(A.*A+A)
sum3(A) = sum(A[i]*A[i]+A[i] for i in 1:N)
sum4(A) = sum(a*a+a for a in A)
function sum5(A)
res = 0.0
for i = 1:N
res += A[i]*A[i]+A[i]
end
end
BenchmarkTools.DEFAULT_PARAMETERS.samples = 10
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2
@btime sum1($A)
@btime sum2($A)
@btime sum3($A)
@btime sum4($A)
@btime sum5($A)
I get
36.951 ms (2 allocations: 76.29 MiB)
80.213 ms (4 allocations: 152.59 MiB)
12.607 ms (4 allocations: 80 bytes)
12.628 ms (3 allocations: 48 bytes)
4.011 ms (0 allocations: 0 bytes)
I understand the first two timings (they imply a vector allocation), but not the third and fourth. Is any of these timings considered a bug? Does it mean I’m condemned to boring loops?
How fast is your sum5
if you actually return the result?
This is maybe too specialized to generalize to your real problem but you can also compare
sum6(A) = sum(x -> x * x + x, A)
That’s a good point… then the speed difference vanishes. I also included an @inbounds @simd
version of the loop. The final benchmark is
using BenchmarkTools
const N = 10000000
A = randn(N)
sum1(A) = sum(A.*A.+A)
sum2(A) = sum(A.*A+A)
sum3(A) = sum(A[i]*A[i]+A[i] for i in 1:N)
sum4(A) = sum(a*a+a for a in A)
function sum5(A)
res = 0.0
for i = 1:N
res += A[i]*A[i]+A[i]
end
res
end
function sum6(A)
res = 0.0
@inbounds @simd for i = 1:N
res += A[i]*A[i]+A[i]
end
res
end
sum7(A) = sum(x -> x * x + x, A)
BenchmarkTools.DEFAULT_PARAMETERS.samples = 10
BenchmarkTools.DEFAULT_PARAMETERS.seconds = 2
@btime sum1($A)
@btime sum2($A)
@btime sum3($A)
@btime sum4($A)
@btime sum5($A)
@btime sum6($A)
@btime sum7($A)
with result (0.6 rc)
38.080 ms (2 allocations: 76.29 MiB)
84.583 ms (4 allocations: 152.59 MiB)
12.619 ms (4 allocations: 80 bytes)
12.606 ms (3 allocations: 48 bytes)
12.629 ms (0 allocations: 0 bytes)
7.230 ms (0 allocations: 0 bytes)
7.297 ms (0 allocations: 0 bytes)
(neither @inbounds
nor @simd
by itself is able to get the 7ms, both are needed)
Amazingly, sum(x → x * x + x, A) is able to SIMD automatically! I don’t understand how that can happen: looking at the implementation, it falls back to mapfoldl_impl, which is a simple while loop…