Troubleshooting an Unhandled error in Parallel Computation on a Slurm Cluster


Hello Julia community! I’m encountering an issue with distributing some computation across threads of multiple nodes in a Slurm cluster. I’m using Distributed.jl, SlurmClusterManager.jl, and SharedArrays.jl to parallelize my code.

Part 1: Setting Up Environment and Parameters

# Load environment on main process
import Pkg
pkgdir = dirname(@__FILE__)
Pkg.activate(pkgdir * "/DefaultEnvironment/")
Pkg.instantiate()
Pkg.precompile()

using Distributed, SlurmClusterManager, SharedArrays, ProgressMeter, JLD

addprocs(SlurmManager()) # Adding threads, Sys.CPU_THREADS is the number of available threads

# Load environment on worker processes
@everywhere import Pkg
@everywhere pkgdir = dirname(@__FILE__) # Embed package dir in expression  
@everywhere Pkg.activate(pkgdir * "/DefaultEnvironment/")

# Function for instantiating the environment on each worker, keep trying until success on each worker (To avoid race condition)
@everywhere function Instantiating()
    try
        Pkg.instantiate()
        Pkg.precompile()
    catch e
        sleep(rand())
        Instantiating()
    end
end

@everywhere Instantiating()

@everywhere include("GeneralMods.jl") # Loading the Modules that are used in the computation
@everywhere DataDir = "../../Data/" # Directory for storing the data

println("Number of workers: ", nworkers()) # Printing the number of workers

flush(stdout)


####################################### Defining the functions #######################################

@everywhere InvasionFitness(θ::AbstractVector;
Sᵢ::AbstractFloat=0.8, b::AbstractFloat=0.4, r::AbstractFloat=12.0,
β::Function=((t; λ::AbstractFloat=0.7, β✶::AbstractFloat=100.0) -> t % 1.0 < λ ? β✶ : 0.0),
ModelODE::Function=ModelODE, Ncs::AbstractVector=Ncs) = begin # Calculating fitness of a monomorphic population
    Pars = GeneralSystem.Parameters(Sᵢ, b, r, β, θ, ModelODE)
    GeneralSystem.InvasionFitness(Ncs, Pars)
end

@everywhere function LogInFile(file::String, str::String) # Function for logging the output to a file
    flush(stdout) # Flushing the output
    open(file, "a") do f
        write(f, str)
        write(f, "\n")
    end
end

@everywhere function worker!(idx::CartesianIndex) # Define a function that will be called by each thread
    i, j = Tuple(idx) # Unpacking the index
    @inbounds try
        InvasionMatrix[i, j] = InvasionFitness([θs[i], θs[j]]) # Calculating the invasion fitness
    catch e
        InvasionMatrix[i, j] = NaN
    end
end

################################ Calculating Invasion Matrix parallel ###############################

@everywhere Sᵢ, λ, b, β, r = 0.8, 0.7, 0.4, 100.0, 12.0 # Initial values of the parameters
@everywhere Ncs = [3, 3] # Vector of number of components
@everywhere ModelODEParse = GeneralSystem.ODECreator(Ncs) # Geting the ODE of the model as a string
@everywhere ModelODE = eval(ModelODEParse) # Evaluating the ODE of the model

@everywhere θs = LinRange(0.01, 1.5, 500) # Traits vector
InvasionMatrix = SharedArray{Float64}(length(θs), length(θs)) # SharedArray for storing the invasion matrix

This part doesn’t have any errors. I debugged it already and will run successfully. The only unclear part might be the GeneralMods.jl file. This file contains a mutable struct that will save required parameters in computation and also will contain some functions that will perform numerical processes and do computations. Previously, I was using Base.Threads to perform parallelization, and everything was alright, so I can claim there is no bug in the scripts of GeneralMods.jl.

Part 2: Parallel Loop with Unhandled Error

LogInFile("MonomorphicIF-LLC.log", "Starting the calculation of invasion matrix") 

@showprogress @distributed for idx ∈ CartesianIndices(InvasionMatrix) # Looping over the resident traits
    worker!(idx) # Calling the worker function
    flush(stdout)
    LogInFile("MonomorphicIF-LLC.log", "θ = [$(θs[i]), $(θs[j])], Invasion Fitness = $(InvasionMatrix[i, j])") # Logging the output to a file
end

InvasionMatrix = Array(InvasionMatrix) # Converting SharedArray to Array

LogInFile("MonomorphicIF-LLC.log", "Finished the calculation of invasion matrix")

Everything worked fine until this part. When we get to the loop line, exactly nothing will happen. There will be no errors, nothing will get logged, and no progress bar will show up. Things will stay like this until the process gets terminated due to timeout.

Part 3: Saving Computed Data

######################################## Saving the new data ########################################

MonomorphicData = Dict("InvasionMatrix" => InvasionMatrix, "Theta" => θs, "Compartments" => [3, 3]) # Creating a dictionary of the data
save(DataDir * "MonomorphicIF-LLC.jld", MonomorphicData) # Saving the data

LogInFile("MonomorphicIF-LLC.log", "Saved the data")

exit() # Exiting the Julia session

If anyone has experience with parallel computing on Slurm or has ideas on how to troubleshoot this type of issue, I’d greatly appreciate your insights and assistance! I’m facing a challenge during the parallel loop execution where no errors are thrown, but the loop doesn’t progress as expected. Any insights on how to troubleshoot this issue would be invaluable.

Additional Details:

  • Note: The GeneralMods.jl file contains a mutable struct and several functions used in the computation. It worked seamlessly in previous implementations using Base.Threads.
  • Let’s collaborate to identify and resolve the problem. If you have experience with distributed computing on Slurm or can share examples that work on Slurm clusters, your insights would be highly valuable.
  • In summary, the script initializes successfully, but the parallel loop doesn’t progress as expected, with no errors thrown. Your help in troubleshooting this issue and any examples of successful implementations on Slurm clusters would be greatly appreciated.

Feel free to share your thoughts, suggestions, or examples. Thank you!

2 Likes