I have a perfectly parallelizable task, which I want to compute using several processes (threads are not applicable here). Say I want to pmap a function f, taking about 1-10 seconds over a vector A of 10M elements. The problem is that f requires to read from big read-only structures (say a huge Dict of terms). The question is how to split A into chunks to get the best compromise of balancing the workload and at the same time not communicating and copying data much.
So far, I have created f as a closure over all structs needed and iterated over A in chunks of 1k elements:
vcat(pmap(f, Base.Iterators.partition(A, 1024))
I wonder if there is a better solution for this as it seems that a lot of time is spent with copying the data and communication. Unfortunately, the section in docs is quite brief in that regard.
- Are the structures needed for
fcopied to the target process every timefis called on another partition or is it done just once at the beginning? - Wouldn’t for example an approach of spawning several processes and sending data back and forth through channels more efficient?
Thanks!