Simple Parallel Examples for Embarrassingly Simple Problems

Don’t time in global scope, use a function instead. And use @btime of BenchmarkTools.jl to give more accurate benchmarking results. The above benchmarks are highly questionable. In global scope, variables are type unstable so you are timing the slow version of Julia. Also if you are running the function only once you will be including the compilation time and allocations of the functions used in your code. Threads.@threads is also likely to beat all the above approaches in this case. Here is a refined set of benchmarks with 8 threads:

using BenchmarkTools

function simple_loop_sum()
    M = 1000000
    n = Vector{Float64}(M)
    for i=1:M; n[i] = log1p(i); end#for
    return sum(n)
end

function sharedarray_parallel_sum()
    M = 1000000
    a = SharedArray{Float64}(M)
    s = @sync @parallel for i=1:M; a[i]=log1p(i); end#for
    return sum(a)
end

function pmap_sum()
    M = 1000000
    r = pmap(log1p, 1:M, batch_size=ceil(Int,M/nworkers()))
    return sum(r)
end

function sharedarray_mapreduce()
    M = 1000000
    a=SharedArray{Float64}(M)
    s= @parallel (+) for i=1:M; a[i]=log1p(i); end#for
    return s
end

function threads_sum()
    M = 1000000
    a=Vector{Float64}(M)
    Threads.@threads for i=1:M
        a[i]=log1p(i)
    end#for
    return sum(a)
end

println("\nplain loop: ", simple_loop_sum())
println("\nsharedarray parallel: ", sharedarray_parallel_sum())
println( "\npmap: ", pmap_sum())
println("\nsharedarray reducer parallel: ", sharedarray_mapreduce())
println("\nthreads: ", threads_sum())

@btime simple_loop_sum()
#16.741 ms (2 allocations: 7.63 MiB)
@btime sharedarray_parallel_sum()
#8.571 ms (2384 allocations: 85.86 KiB)
@btime pmap_sum()
#4.120 s (7012363 allocations: 181.55 MiB)
@btime sharedarray_mapreduce()
#7.916 ms (1963 allocations: 122.11 KiB)
@btime threads_sum()
#4.039 ms (3 allocations: 7.63 MiB)

The pmap version is really absurd but I am not sure why.

1 Like