I have a perfectly parallelizable task, which I want to compute using several processes (threads are not applicable here). Say I want to
pmap a function
f, taking about 1-10 seconds over a vector
A of 10M elements. The problem is that
f requires to read from big read-only structures (say a huge
Dict of terms). The question is how to split
A into chunks to get the best compromise of balancing the workload and at the same time not communicating and copying data much.
So far, I have created
f as a closure over all structs needed and iterated over
A in chunks of 1k elements:
vcat(pmap(f, Base.Iterators.partition(A, 1024))
I wonder if there is a better solution for this as it seems that a lot of time is spent with copying the data and communication. Unfortunately, the section in docs is quite brief in that regard.
- Are the structures needed for
fcopied to the target process every time
fis called on another partition or is it done just once at the beginning?
- Wouldn’t for example an approach of spawning several processes and sending data back and forth through channels more efficient?