Distributed generation of a DArray

ChrisRackauckas · June 26, 2017, 4:29pm

I have generated arrays on separate processes, and cannot collect them into a single array on the master process because it cannot fit into memory. Is there a way to directly generate the DArray without having to put it on the master process?

andreasnoack · June 26, 2017, 4:59pm

It is described in https://github.com/JuliaParallel/DistributedArrays.jl#constructing-distributed-arrays. How did you construct the arrays on the workers?

ChrisRackauckas · June 26, 2017, 5:09pm

Minimal example of what I am trying, given my current understanding of the README:

addprocs()
N = 10
@everywhere begin
  srand(myid())
  r = rand(N)
end
using DistributedArrays
dfill(r,((N for i in 1:nprocs())...))

But the dfill doesn’t grab the part from each process. And what are you supposed to do if the array is not the same name on every process?

andreasnoack · June 26, 2017, 5:28pm

You should try to avoid defining variables on the workers with @everywhere. The best approach is to define the local parts with a closure, i.e. something like

DArray((N*nworkers(),)) do I
    srand(myid())
    rand(N) # or rand(length.(I))
end

Alternatively, you can create the local parts and keep a reference to them with a Future and then construct a DArray like

rAs = [@spawnat p rand(N) for p in workers()]
4-element Array{Future,1}:
 Future(2, 1, 382, #NULL)
 Future(3, 1, 383, #NULL)
 Future(4, 1, 384, #NULL)
 Future(5, 1, 385, #NULL)

DArray(rAs)

ChrisRackauckas · June 26, 2017, 5:35pm

Thanks. Those are definitely a better way of handling it.

Is the DArray constructor able to handle the same stuff as pmap? For example, does it need to use a CachingPool for large anonymous functions? Is it essentially a batch_size = N/nprocs() kind of pmap?

andreasnoack · June 26, 2017, 5:40pm

Is it essentially a batch_size = N/nprocs() kind of pmap?

Yes, I think that is the right way to think of it. The closure is executed once on each of the workers specified (all if none are specified) so in that sense, the batch size is N/nworkers(). Notice that nprocs() > 1 ? nworkers() == nprocs() - 1 : nworkers() == 1.

Topic		Replies	Views
How to efficiently handle parallelism on a DistributedArray Performance performance , parallel , distributed , loops	8	617	May 24, 2022
Using data from remote workers to upload DArray New to Julia parallel , array	0	469	July 6, 2017
Unable to call a distributed array in a parallel subprocess General Usage parallel , distributed	1	474	April 7, 2020
Distributed arrays in for loop General Usage parallel	2	68	August 28, 2024
Unexpect results constructing a DArray New to Julia distributed	0	309	August 31, 2020

Distributed generation of a DArray

Related topics