Don’t time in global scope, use a function instead. And use @btime
of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. Threads.@threads
is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:
using BenchmarkTools
function simple_loop_sum()
M = 1000000
n = Vector{Float64}(M)
for i=1:M; n[i] = log1p(i); end#for
return sum(n)
end
function sharedarray_parallel_sum()
M = 1000000
a = SharedArray{Float64}(M)
s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
return sum(a)
end
function pmap_sum()
M = 1000000
r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
return sum(r)
end
function sharedarray_mapreduce()
M = 1000000
a=SharedArray{Float64}(M)
s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
return s
end
function threads_sum()
M = 1000000
a=Vector{Float64}(M)
Threads.@threads for i=1:M
a[i]=log1p(i)
end#for
return sum(a)
end
println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())
println("\nthreads: ", threads_sum())
@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
@btime threads_sum()
#4.039 ms (3 allocations: 7.63 MiB)
The pmap
version is really absurd but I am not sure why.