Memory problem when using SlurmClusterManager.jl to add workers

I am new to using Distributed.jl on a cluster. I am trying to run jobs on a cluster with multiple nodes. Even though I specify a single node my Julia program should run on in the batch file, there seems to be a problem that memory is not shared.
The problem only seems to arise when I use SlurmClusterManager to add workers.

Does anyone have an idea what the problem is?

Here is my batch file:

#!/bin/bash
#SBATCH --ntasks=5
#SBATCH --nodes=1
#SBATCH --nodelist=node_01
#SBATCH --cpus-per-task=1
#SBATCH --time=00:04:00
#SBATCH --output=output/example-par-job_%j.out

# Load the Julia module
module purge
module load Julia/1.10.2

# Run the Julia script
julia --threads 1 par_test_script.jl

This is the content of par_test_script.jl that causes an error because the worker cannot find m and output in memory:

using Distributed, SharedArrays, SlurmClusterManager

# Add local workers
addprocs(SlurmManager())
println("Number of workers: ", nworkers())

@everywhere begin
    using SharedArrays
    
    m = SharedArray{Int}(2)
    m[1] = 1
    m[2] = 2

    function foo(a, m)
        println("Worker ID: $(myid())")
        println(gethostname())
        return sum(a .+ m)
    end
end

N = 10

# Shared array for result collection
output = SharedArray{Int}(N)

@sync @distributed for i in 1:N
    output[i] = foo(i, m)
end

display(output)
println("Finished Julia script")

Here is the result:

UNHANDLED TASK ERROR: On worker 2:
BoundsError: attempt to access 0-element Vector{Int64} at index [1]
Stacktrace:
  [1] setindex!
    @ ./array.jl:1021 [inlined]
  [2] setindex!
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/SharedArrays/src/SharedArrays.jl:512
  [3] macro expansion
    @ ~/test_par_prjct/par_test_script.jl:36 [inlined]
  [4] #1
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:303
  [5] #178
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:83
  [6] #invokelatest#2
    @ ./essentials.jl:892 [inlined]
  [7] invokelatest
    @ ./essentials.jl:889
  [8] #107
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:283
  [9] run_work_thunk
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
 [10] run_work_thunk
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:79
 [11] #100
    @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:88

...and 4 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:448
 [2] macro expansion
   @ ./task.jl:480 [inlined]
 [3] (::Distributed.var"#177#179"{var"#1#2", UnitRange{Int64}})()
   @ Distributed /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:278
ERROR: LoadError: TaskFailedException

    nested task error: On worker 2:
    BoundsError: attempt to access 0-element Vector{Int64} at index [1]
    Stacktrace:
      [1] setindex!
        @ ./array.jl:1021 [inlined]
      [2] setindex!
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/SharedArrays/src/SharedArrays.jl:512
      [3] macro expansion
        @ ~/test_par_prjct/par_test_script.jl:36 [inlined]
      [4] #1
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:303
      [5] #178
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:83
      [6] #invokelatest#2
        @ ./essentials.jl:892 [inlined]
      [7] invokelatest
        @ ./essentials.jl:889
      [8] #107
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:283
      [9] run_work_thunk
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
     [10] run_work_thunk
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:79
     [11] #100
        @ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:88
    
    ...and 4 more exceptions.
    
    Stacktrace:
     [1] sync_end(c::Channel{Any})
       @ Base ./task.jl:448
     [2] macro expansion
       @ ./task.jl:480 [inlined]
     [3] (::Distributed.var"#177#179"{var"#1#2", UnitRange{Int64}})()
       @ Distributed /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:278
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:448
 [2] macro expansion
   @ task.jl:480 [inlined]
 [3] top-level scope
   @ ~/test_par_prjct/par_test_script.jl:478
in expression starting at /test_par_prjct/par_test_script.jl:35
Number of workers: 5
Worker ID: 3
node_01
Worker ID: 2
Worker ID: 4
node_01
node_01
Worker ID: 5
node_01
Worker ID: 6
node_01


This Julia script does not cause an error. It is identical up to adding the workers manually instead of using SlurmManager():

using Distributed, SharedArrays, SlurmClusterManager

# Add local workers
addprocs(5)
println("Number of workers: ", nworkers())

@everywhere begin
    using SharedArrays
    
    m = SharedArray{Int}(2)
    m[1] = 1
    m[2] = 2

    function foo(a, m)
        println("Worker ID: $(myid())")
        println(gethostname())
        return sum(a .+ m)
    end
end

N = 10

# Shared array for result collection
output = SharedArray{Int}(N)

@sync @distributed for i in 1:N
    output[i] = foo(i, m)
end

display(output)
println("Finished Julia script")

The output here:

Number of workers: 5
      From worker 6:	Worker ID: 6
      From worker 6:	node_01
      From worker 6:	Worker ID: 6
      From worker 6:	node_01
      From worker 4:	Worker ID: 4
      From worker 4:	node_01
      From worker 4:	Worker ID: 4
      From worker 4:    node_01
      From worker 3:	Worker ID: 3
      From worker 3:	node_01
      From worker 3:	Worker ID: 3
      From worker 3:	node_01
      From worker 2:	Worker ID: 2
      From worker 2:	node_01
      From worker 2:	Worker ID: 2
      From worker 2:	node_01
      From worker 5:	Worker ID: 5
      From worker 5:	node_01
      From worker 5:	Worker ID: 5
      From worker 5:	node_01
10-element SharedVector{Int64}:
  5
  7
  9
 11
 13
 15
 17
 19
 21
 23
Finished Julia script

I can reproduce your exact error. I think the problem is here:

@sync @distributed for i in 1:N
    output[i] = foo(i, m)
end

Removing the output[i] fixed the issue for me.

julia> @sync @distributed for i in 1:N
            foo(i, m)
       end
Worker ID: 3
Worker ID: 3
Worker ID: 4
Worker ID: 4
Worker ID: 6
Worker ID: 6
Worker ID: 2
Worker ID: 2
Worker ID: 5
Worker ID: 5
Task (done) @0x000077c930b7c010

but I am not really sure why.

It seems to be a bug in SharedArrays

julia> fetch(@spawnat 1 output[1:end])
10-element Vector{Int64}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

julia> fetch(@spawnat 2 output[1:end]) # on the worker, shows 10 elements
10-element Vector{Int64}:
                   0
          8589934616
          4294967297
 1157425241673170952
                   1
          8589934608
          4294967297
  581245895626981384
                   1
                  32

julia> fetch(@spawnat 2 output[3])    # on the worker, but indexing fails. 
ERROR: On worker 2:
BoundsError: attempt to access 0-element Vector{Int64} at index [3]

I did also find that the docs particularly say that processes have to be on the same host, so not sure if SharedArrays is compatible with node-based cluster.

Construct a `SharedArray` of a bits type `T` and size `dims` across the
processes specified by `pids` - all of which have to be on the same
host.  If `N` is specified by calling `SharedArray{T,N}(dims)`, then
`N` must match the length of `dims`.
1 Like

True, but I made SURLM only run on one node. All CPUs should have access to the same memory. It should work unless there is some other reason apart from shared memory. It also works with ‘manually’ creating the workers but not with SlurmClusterManager.

Thanks! But what would the output be written into then? I could store the results within foo in an output array but it would be weird if that worked.

Thanks! Good to know that it is not the distributed loop. I will switch to pmap for now even though using SharedArrays would have given me much more flexibility.