Passing arrays in the cluster

pmarg · September 18, 2018, 12:22pm

Cross posted from slack:

Any ideas why does the following MWE work locally but not on the cluster?

using ClusterManagers, Distributed
@everywhere using SharedArrays
addprocs(SlurmManager(3), t="00:5:00")

@everywhere function f(x,y,z)
   Nx = size(x,1)
   Ny = size(y,1)
   Nz = size(z,1)
   A = SharedArray{Float64}(Nx,Ny,Nz);

  @sync begin
  @distributed for i=1:Nx
					   for j=1:Ny
						   for k=1:Nz
							   A[i,j,k] = x[i]^3+y[j]^3+z[k]^3
						   end
					   end
				   end
            end
   return A
end

x = SharedArray{Float64}(randn(40))
y = SharedArray{Float64}(randn(30))
z = SharedArray{Float64}(randn(20))

B =  f(x,y,z)

I get the following error:

ERROR: LoadError: On worker 2:
BoundsError: attempt to access 0-element Array{Float64,1} at index [1]
getindex at ./array.jl:731 [inlined]
getindex at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/SharedArrays/src/SharedArrays.jl:498 [inlined]
macro expansion at /domus/h1/pmarg/Projects/Test/test.jl:32 [inlined]
#3 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/macros.jl:291
#170 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/macros.jl:43
#109 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:265
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:56
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/process_messages.jl:65
#102 at ./task.jl:262
#remotecall_fetch#149(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Distributed.RRID, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:379
remotecall_fetch(::Function, ::Distributed.Worker, ::Distributed.RRID, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:371
#remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Distributed.RRID, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:392
call_on_owner at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:392 [inlined]
wait at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/remotecall.jl:486 [inlined]
_wait(::Future) at ./task.jl:196
sync_end(::Array{Any,1}) at ./task.jl:216
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/Distributed/src/macros.jl:268 [inlined]
(::getfield(Distributed, Symbol("##169#171")){getfield(Main, Symbol("##3#4")){SharedArray{Float64,1},SharedArray{Float64,1},SharedArray{Float64,1},Int64,Int64,SharedArray{Float64,3}},UnitRange{Int64}})() at ./task.jl:247
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:229
 [2] macro expansion at /domus/h1/pmarg/Projects/Test/test.jl:29 [inlined]
 [3] macro expansion at ./task.jl:247 [inlined]
 [4] f(::SharedArray{Float64,1}, ::SharedArray{Float64,1}, ::SharedArray{Float64,1}) at /domus/h1/pmarg/Projects/Test/test.jl:28
 [5] top-level scope at none:0
 [6] include at ./boot.jl:317 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] exec_options(::Base.JLOptions) at ./client.jl:239
 [10] _start() at ./client.jl:432
in expression starting at /domus/h1/pmarg/Projects/Test/test.jl:49
****************
Starting.......
****************
connecting to worker 1 out of 3
connecting to worker 2 out of 3
connecting to worker 3 out of 3
Job Completed

vchuravy · September 18, 2018, 2:24pm

Slurm does not gurantee that it will allocate on the same machine! SharedArrays requires the processes do be on the same machine. Take a look at DistributedArrays.jl instead.

pmarg · September 18, 2018, 2:27pm

Thank you for the reply! Just out of curiosity, this means that the code above sometimes works and others doesn’t depending on the workload of the cluster?

vchuravy · September 18, 2018, 2:29pm

Yes that could happen. You can ask slurm to only give you processes on a single machine, but you lose generality.

Topic		Replies	Views
Struggling to figure out how I should use shared arrays on a slurm cluster using remote workers Julia at Scale question	11	1064	May 8, 2020
@spawnat with SlurmClusterManager General Usage question , distributed	3	614	April 4, 2022
Memory problem when using SlurmClusterManager.jl to add workers Julia at Scale distributed , slurm , parallel-computing	6	68	July 28, 2025
Alternative to SharedArrays for multi-node cluster? General Usage question	9	1129	April 20, 2020
SharedArrays and process affinity General Usage question , cluster , sharedarrays	3	505	February 24, 2021

Passing arrays in the cluster

Related topics