Passing arrays in the cluster

Cross posted from slack:

Any ideas why does the following MWE work locally but not on the cluster?

using ClusterManagers, Distributed
@everywhere using SharedArrays
addprocs(SlurmManager(3), t="00:5:00")

@everywhere function f(x,y,z)
   Nx = size(x,1)
   Ny = size(y,1)
   Nz = size(z,1)
   A = SharedArray{Float64}(Nx,Ny,Nz);

  @sync begin
  @distributed for i=1:Nx
					   for j=1:Ny
						   for k=1:Nz
							   A[i,j,k] = x[i]^3+y[j]^3+z[k]^3
						   end
					   end
				   end
            end
   return A
end

x = SharedArray{Float64}(randn(40))
y = SharedArray{Float64}(randn(30))
z = SharedArray{Float64}(randn(20))

B =  f(x,y,z)

I get the following error:

ERROR: LoadError: On worker 2:
BoundsError: attempt to access 0-element Array{Float64,1} at index [1]
getindex at ./array.jl:731 [inlined]
getindex at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/SharedArrays/src/SharedArrays.jl:498 [inlined]
macro expansion at /domus/h1/pmarg/Projects/Test/test.jl:32 [inlined]
#3 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/macros.jl:291
#170 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/macros.jl:43
#109 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:265
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:56
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:65
#102 at ./task.jl:262
#remotecall_fetch#149(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Distributed.RRID, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:379
remotecall_fetch(::Function, ::Distributed.Worker, ::Distributed.RRID, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:371
#remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Distributed.RRID, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:392
call_on_owner at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:392 [inlined]
wait at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:486 [inlined]
_wait(::Future) at ./task.jl:196
sync_end(::Array{Any,1}) at ./task.jl:216
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/macros.jl:268 [inlined]
(::getfield(Distributed, Symbol("##169#171")){getfield(Main, Symbol("##3#4")){SharedArray{Float64,1},SharedArray{Float64,1},SharedArray{Float64,1},Int64,Int64,SharedArray{Float64,3}},UnitRange{Int64}})() at ./task.jl:247
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:229
 [2] macro expansion at /domus/h1/pmarg/Projects/Test/test.jl:29 [inlined]
 [3] macro expansion at ./task.jl:247 [inlined]
 [4] f(::SharedArray{Float64,1}, ::SharedArray{Float64,1}, ::SharedArray{Float64,1}) at /domus/h1/pmarg/Projects/Test/test.jl:28
 [5] top-level scope at none:0
 [6] include at ./boot.jl:317 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] exec_options(::Base.JLOptions) at ./client.jl:239
 [10] _start() at ./client.jl:432
in expression starting at /domus/h1/pmarg/Projects/Test/test.jl:49
****************
Starting.......
****************
connecting to worker 1 out of 3
connecting to worker 2 out of 3
connecting to worker 3 out of 3
Job Completed

Slurm does not gurantee that it will allocate on the same machine! SharedArrays requires the processes do be on the same machine. Take a look at DistributedArrays.jl instead.

Thank you for the reply! Just out of curiosity, this means that the code above sometimes works and others doesn’t depending on the workload of the cluster?

Yes that could happen. You can ask slurm to only give you processes on a single machine, but you lose generality.