I have a perfectly parallelizable task, which I want to compute using several processes (threads are not applicable here). Say I want to pmap
a function f
, taking about 1-10 seconds over a vector A
of 10M elements. The problem is that f
requires to read from big read-only structures (say a huge Dict
of terms). The question is how to split A
into chunks to get the best compromise of balancing the workload and at the same time not communicating and copying data much.
So far, I have created f
as a closure over all structs needed and iterated over A
in chunks of 1k elements:
vcat(pmap(f, Base.Iterators.partition(A, 1024))
I wonder if there is a better solution for this as it seems that a lot of time is spent with copying the data and communication. Unfortunately, the section in docs is quite brief in that regard.
- Are the structures needed for
f
copied to the target process every timef
is called on another partition or is it done just once at the beginning? - Wouldn’t for example an approach of spawning several processes and sending data back and forth through channels more efficient?
Thanks!