How can I make some data (e.g. an array or a dataframe) available to all data?
I tried using ParallelDatatrasfer.jl but that is not working (see my issue here https://github.com/ChrisRackauckas/ParallelDataTransfer.jl/issues/16 )
In the example below, assume I want to perform an expensive calculation on vvvec (on serveral workers in parallel).
using Distributed
using ParallelDataTransfer
addprocs(3)
@everywhere using Random
vvvec=[2,3,1]
sendto(workers(),vvvec=vvvec)
@everywhere vvvec.^2
bump.
Is that a difficult thing to do, or is my question not well formulated?
In my view it is a pretty common use case, e.g. to work on a large piece of data (maybe read from a CSV, or generated by some piece of code) with different parameter settings (maybe model parameters). Thus I often want to share data created on the main worker with all other workers.
I think you are looking for DistributedArrays
:
https://juliaparallel.github.io/DistributedArrays.jl/latest/
using Distributed
addprocs(3)
@everywhere using DistributedArrays, Distributed
@everywhere f(x) = x * myid()
data = distribute([1,1,1,1])
julia> f.(data)
4-element DArray{Int64,1,Array{Int64,1}}:
2
2
3
4
As you can see the first two elements are processed on worker 2, and the 3 and 4th on worker 3 and 4.
Otherwise there’s also SharedArrays
.
I have found the interpolation syntax useful for this:
https://docs.julialang.org/en/v1/stdlib/Distributed/index.html#Distributed.@everywhere
foo = 1
@everywhere bar = $foo
so foo
can point to data loaded on your main worker, maybe even unique to its filesystem, and then get loaded everywhere else under bar
2 Likes