I would like to parallelize a function but returning large objects makes pmap extremely slow.
using Distributed
using SharedArrays
addprocs(length(Sys.cpu_info())-1)
@everywhere using Random
@everywhere function foo(i)
Random.seed!(i)
big_array = rand(100,100)
Tuple(big_array) # need be bits type
end
n = 10
big_shared_array = SharedArray{NTuple{100*100,Float32}}(n);
pmap(1:n) do i
big_shared_array[i] = foo(i); #this is incredibly slow
end
For the first few minutes Julia processes (except for one) simply hang doing nothing, making the code exteremely slow – see figure below, CPU usage on the right.
Am I doing something wrong? If not, is there another option to parallelize my code? I would prefer not to use multi-threading because, except for this specific case, it is slower than pmap.
using BenchmarkTools
using Distributed
using SharedArrays
addprocs(length(Sys.cpu_info())÷2-1)
@everywhere using Random
@everywhere N = 100
@everywhere function foo(i)
Random.seed!(i)
rand(N,N)
end
n = 10
@btime array = pmap(1:n) do i
foo(i)
end
array = Array{Matrix{Float64}}(undef, n);
@btime (for i in 1:n
array[i] = foo(i)
end)
@btime (Threads.@threads for i in 1:n
array[i] = foo(i)
end)
The function that I want to parallelize is way more demanding than the one in my example. For this reason parallelizing the function is faster than not doing so. Also, i tested my code both using threads and pmap: for some reason with multitreading the CPU is not fully utilized, resulting in slower code compared to parallelization (which uses the 100% of the CPU).
In extending this function, I wanted to return a big array (as the one in the example). This makes pmap considerably slower than multi-threading (possibly because of overhead?). I know I can solve this problem simply using threads, but I am interested in knowing if there is a way to continue using parallelization (which, again, uses the 100% of the CPU).
Maybe distributed arrays are a better option compared to shared arrays in this case. However, I do not understand how they work.
using BenchmarkTools
using Distributed
using SharedArrays
addprocs(length(Sys.cpu_info())÷2-1)
@everywhere using Random
@everywhere N = 100
@everywhere function foo(i)
Random.seed!(i)
NTuple{N*N,Float64}(rand(N,N))
end
n = 10
@btime (array = pmap(1:n) do i
foo(i)
end)
println()
So, from what I understand, having large Tuples is the problem. I return a Tuple simply because SharedArrays requires bits type elements. Maybe I should look more into distributed arrays, as they do not require bits type elements (if I am not wrong). Thanks for your help.