How to avoid recompiling when using job array on a cluster

Hi all,
I am trying to run MCMC sampling for a model built with Turing.jl. Since different chains can be sampled in parallel using multiple processes, I want to use the following line:

# some lines in my code, model_sim.jl
sample_num = 1000
chain_num = 3
chain = sample(test_model, NUTS(0.65) ,MCMCDistributed(), sample_num , chain_num )

Since I want to do this for a lot of models that only differ in a few parameter values, I thought about doing this with job arrays on the cluster, with each model as a task, and with 3 CPUs assigned to each task.

# 
#!/bin/bash
#SBATCH --job-name=array-job     # create a short name for the job
#SBATCH --output=slurm-%A.%a.out # stdout file
#SBATCH --error=slurm-%A.%a.err  # stderr file
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=3        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --array=1-10%5            # job array with index values 1 to 10, but max 5 at once

julia model_sim.jl

Although the jobs are parallelized, the code is recompiled each time, thus greatly reducing the benefit of parallelization. Is there a way to avoid recompiling on each new job in the job array aside from pre-compiling my code into a relocatable app?

Use a combination of pmap and ClusterManagers. jl. The way I organize my code is like following:

# simulation.jl
module MySimulation
   function run_simuation(params) 
      # long simulation code 
   end
end

# run.jl 
using ClusterManagers 
addprocs(SlurmManager(N) ...) # add other options if needed 
results = pmap(run_simuation, 1:N) # where N is the number of sims. 

The first time run_simuation is run on each core, it will compile. The next time it runs on the same core, it will use the compiled version.

Another benefit of this is that the results of your simulations are simply stored as an array in results. This makes post-processing much easier (and actually you can make the post processing parallel as well since you have the worker processes already launched).

2 Likes

Thank you for sharing your workflow! I think this would be optimal if I am simulating N independent chains.

If my run_simulation function randomly samples parameter values and runs 3 chains, what’s a good way to build the dependency of these three chains into the pmap function? Is my only option to sample N parameter values, duplicate them 3 times, once for each chain, and store them for run_simulation to access later prior to running pmap(run_simuation, 1:N) ?

Maybe you want to move the pmap around and break up your run_simulation to something like

function run_simulation
   for i = 1:3 
      sample_value = rand() 
      pmap(x -> run_chain(sample_value), 1:N) # runs N simulations on m cores  
   end
end

function run_chain() 
   #  complicated, long work here
end

Does that help? I have a feeling that this is not what you are looking for.

2 Likes

Not quite, thank you though!
What I’m looking for is closer to the following:

function run_simulation
   for model_i in 1:N
      sample_value = rand()

      for chain_i in 1:3
         run_chain(sample_value)
      end
      # check if the 3 independent chains mixed well and other stuff
   end
   

Would be great if both the chain loop and the model loop can be parallelized somehow. I think pmap can’t be nested so that’s the main challenge for me at the moment.

I usually use Iterators.product with pmap:

repeats=3
n=10
seeds=rand(n)

wrapped_fn(vals)=run_chain(vals[1])
chains=pmap(wrapped_fn, Iterators.product(seeds,1:repeats))

# chains is n by 3
# can check if the 3 chains in each row are the same in parallel too, up to you

This really helps for doing a grid search of parameter, just like you would with a job array. Hopefully, this is helpful.

1 Like

Also, you can consider creating a system image (Compiling Sysimages · Julia in VS Code and Home · PackageCompiler) for your code, so that it will have already precompiled before running the job array.

thanks for the tip, passing in seeds as a solution is unexpected but really helpful. Regarding sysimage, I remember that the packageCompiler doc mentioned that sysimage from one machine can’t be used on another? So would I need to create a system image on the cluster? Or can I make one on my computer and somehow configure it before passing it to the job array on the cluster?

Great! Glad to have helped.

I haven’t used the sysimage on my HPC at all, but since most clusters are designed to have the same architecture and have a shared filesystem, making it on one node (with something like an interactive slurm session) should be fine. When you connect to the other workers with addprocs there is a kwarg that is exeflags which may let you load a sysimage. This guide will hopefully have everything you need.

I may try it myself and see how it goes, as I imagine it would be very helpful.