I am new to using Distributed.jl
on a cluster. I am trying to run jobs on a cluster with multiple nodes. Even though I specify a single node my Julia program should run on in the batch file, there seems to be a problem that memory is not shared.
The problem only seems to arise when I use SlurmClusterManager
to add workers.
Does anyone have an idea what the problem is?
Here is my batch file:
#!/bin/bash
#SBATCH --ntasks=5
#SBATCH --nodes=1
#SBATCH --nodelist=node_01
#SBATCH --cpus-per-task=1
#SBATCH --time=00:04:00
#SBATCH --output=output/example-par-job_%j.out
# Load the Julia module
module purge
module load Julia/1.10.2
# Run the Julia script
julia --threads 1 par_test_script.jl
This is the content of par_test_script.jl
that causes an error because the worker cannot find m
and output
in memory:
using Distributed, SharedArrays, SlurmClusterManager
# Add local workers
addprocs(SlurmManager())
println("Number of workers: ", nworkers())
@everywhere begin
using SharedArrays
m = SharedArray{Int}(2)
m[1] = 1
m[2] = 2
function foo(a, m)
println("Worker ID: $(myid())")
println(gethostname())
return sum(a .+ m)
end
end
N = 10
# Shared array for result collection
output = SharedArray{Int}(N)
@sync @distributed for i in 1:N
output[i] = foo(i, m)
end
display(output)
println("Finished Julia script")
Here is the result:
UNHANDLED TASK ERROR: On worker 2:
BoundsError: attempt to access 0-element Vector{Int64} at index [1]
Stacktrace:
[1] setindex!
@ ./array.jl:1021 [inlined]
[2] setindex!
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/SharedArrays/src/SharedArrays.jl:512
[3] macro expansion
@ ~/test_par_prjct/par_test_script.jl:36 [inlined]
[4] #1
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:303
[5] #178
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:83
[6] #invokelatest#2
@ ./essentials.jl:892 [inlined]
[7] invokelatest
@ ./essentials.jl:889
[8] #107
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:283
[9] run_work_thunk
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
[10] run_work_thunk
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:79
[11] #100
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:88
...and 4 more exceptions.
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:448
[2] macro expansion
@ ./task.jl:480 [inlined]
[3] (::Distributed.var"#177#179"{var"#1#2", UnitRange{Int64}})()
@ Distributed /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:278
ERROR: LoadError: TaskFailedException
nested task error: On worker 2:
BoundsError: attempt to access 0-element Vector{Int64} at index [1]
Stacktrace:
[1] setindex!
@ ./array.jl:1021 [inlined]
[2] setindex!
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/SharedArrays/src/SharedArrays.jl:512
[3] macro expansion
@ ~/test_par_prjct/par_test_script.jl:36 [inlined]
[4] #1
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:303
[5] #178
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:83
[6] #invokelatest#2
@ ./essentials.jl:892 [inlined]
[7] invokelatest
@ ./essentials.jl:889
[8] #107
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:283
[9] run_work_thunk
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:70
[10] run_work_thunk
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:79
[11] #100
@ /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/process_messages.jl:88
...and 4 more exceptions.
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:448
[2] macro expansion
@ ./task.jl:480 [inlined]
[3] (::Distributed.var"#177#179"{var"#1#2", UnitRange{Int64}})()
@ Distributed /software/spack/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-13.2.0/julia-1.10.2-4md6o2sitswrvm6wlfiaa4llylglc2rq/share/julia/stdlib/v1.10/Distributed/src/macros.jl:278
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:448
[2] macro expansion
@ task.jl:480 [inlined]
[3] top-level scope
@ ~/test_par_prjct/par_test_script.jl:478
in expression starting at /test_par_prjct/par_test_script.jl:35
Number of workers: 5
Worker ID: 3
node_01
Worker ID: 2
Worker ID: 4
node_01
node_01
Worker ID: 5
node_01
Worker ID: 6
node_01
This Julia script does not cause an error. It is identical up to adding the workers manually instead of using SlurmManager()
:
using Distributed, SharedArrays, SlurmClusterManager
# Add local workers
addprocs(5)
println("Number of workers: ", nworkers())
@everywhere begin
using SharedArrays
m = SharedArray{Int}(2)
m[1] = 1
m[2] = 2
function foo(a, m)
println("Worker ID: $(myid())")
println(gethostname())
return sum(a .+ m)
end
end
N = 10
# Shared array for result collection
output = SharedArray{Int}(N)
@sync @distributed for i in 1:N
output[i] = foo(i, m)
end
display(output)
println("Finished Julia script")
The output here:
Number of workers: 5
From worker 6: Worker ID: 6
From worker 6: node_01
From worker 6: Worker ID: 6
From worker 6: node_01
From worker 4: Worker ID: 4
From worker 4: node_01
From worker 4: Worker ID: 4
From worker 4: node_01
From worker 3: Worker ID: 3
From worker 3: node_01
From worker 3: Worker ID: 3
From worker 3: node_01
From worker 2: Worker ID: 2
From worker 2: node_01
From worker 2: Worker ID: 2
From worker 2: node_01
From worker 5: Worker ID: 5
From worker 5: node_01
From worker 5: Worker ID: 5
From worker 5: node_01
10-element SharedVector{Int64}:
5
7
9
11
13
15
17
19
21
23
Finished Julia script