How to access the localpart of a distributed array?


I was trying to learn distributed array. And I can distribute the arrays using distribute, however I can’t work out how to get the localparts of arrays.

As a MWE here what I am trying to do

@everywhere using DistributedArrays

a = rand(1:5, 1_000_000)
b= rand(1_000_000)

@time da= distribute(a)
@time db= distribute(a)

sum(da) #these work fine

# I want to something like
@everywhere res = localpart(da) .* localpart(db)

But it says da and db are not defined on the workers, which makes sense, so how can define da and db in the workers?

My actual use cases uses much more complicated algorithms, but I would lilke to work with two or more distributed vectors.

Also just realised, wouldn’t it cause an issue where the vectors are distributed differently on different workers? I need finer control over how I distribute the vectors too then.


Two things:

1- if you just do res=da.*db, then res is a distributed array with its localparts corresponding to what I think you want.
2- the message that you are seeing has to do with @everywhere. from the help:

@everywhere bar=1

will define on all processes.

Unlike @spawn and @spawnat,@everywhere does not capture any local variables. Prefixing @everywhere with @eval allows us to broadcast local variables using interpolation :

  foo = 1
  @eval @everywhere bar=$foo

So if you really wanted res to be the same variable name everywhere, this will do:

@eval @everywhere res=localpart($da).*localpart($db)


The reason why distributed arrays are useful is because we can parallelize operations? Does res=da.*db auto-parallelize? The solution on how to do this is not clear.


Yes and yes, but it needs help.


Yes, as Chris said. Not everything is implemented or optimal but for simple operations it is.
The easiest way to check

@which da.*db

If it is not implemented, falls back on AbstractArray implementation (which can be slow).

In this case you will see that it is implemented and does what you want, efficiently.


Actually my use-case was alot more complicated than .*. I guess I have to learn alot more about distributed arrays before I can do what I want with it.


A more general want to code with them is as follows:

for ip in procs(da)
   @spawnat ip begin
        Do things with local parts or access other parts (entail communication)


How do I refer to da in each of the subprocesses?


If you use the spawnat macro, then just da.

@spawnat procs(da)[2] sum(localpart(da))

Would work, for example. @spawnat interpolates, unlike @everywhere.