Hi all,
I am trying to run MCMC sampling for a model built with Turing.jl. Since different chains can be sampled in parallel using multiple processes, I want to use the following line:
# some lines in my code, model_sim.jl
sample_num = 1000
chain_num = 3
chain = sample(test_model, NUTS(0.65) ,MCMCDistributed(), sample_num , chain_num )
Since I want to do this for a lot of models that only differ in a few parameter values, I thought about doing this with job arrays on the cluster, with each model as a task, and with 3 CPUs assigned to each task.
#
#!/bin/bash
#SBATCH --job-name=array-job # create a short name for the job
#SBATCH --output=slurm-%A.%a.out # stdout file
#SBATCH --error=slurm-%A.%a.err # stderr file
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-task=3 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --array=1-10%5 # job array with index values 1 to 10, but max 5 at once
julia model_sim.jl
Although the jobs are parallelized, the code is recompiled each time, thus greatly reducing the benefit of parallelization. Is there a way to avoid recompiling on each new job in the job array aside from pre-compiling my code into a relocatable app?
Use a combination of pmap and ClusterManagers. jl. The way I organize my code is like following:
# simulation.jl
module MySimulation
function run_simuation(params)
# long simulation code
end
end
# run.jl
using ClusterManagers
addprocs(SlurmManager(N) ...) # add other options if needed
results = pmap(run_simuation, 1:N) # where N is the number of sims.
The first time run_simuation is run on each core, it will compile. The next time it runs on the same core, it will use the compiled version.
Another benefit of this is that the results of your simulations are simply stored as an array in results. This makes post-processing much easier (and actually you can make the post processing parallel as well since you have the worker processes already launched).
Thank you for sharing your workflow! I think this would be optimal if I am simulating N independent chains.
If my run_simulation function randomly samples parameter values and runs 3 chains, what’s a good way to build the dependency of these three chains into the pmap function? Is my only option to sample N parameter values, duplicate them 3 times, once for each chain, and store them for run_simulation to access later prior to running pmap(run_simuation, 1:N) ?
Maybe you want to move the pmap around and break up your run_simulation to something like
function run_simulation
for i = 1:3
sample_value = rand()
pmap(x -> run_chain(sample_value), 1:N) # runs N simulations on m cores
end
end
function run_chain()
# complicated, long work here
end
Does that help? I have a feeling that this is not what you are looking for.
Not quite, thank you though!
What I’m looking for is closer to the following:
function run_simulation
for model_i in 1:N
sample_value = rand()
for chain_i in 1:3
run_chain(sample_value)
end
# check if the 3 independent chains mixed well and other stuff
end
Would be great if both the chain loop and the model loop can be parallelized somehow. I think pmap can’t be nested so that’s the main challenge for me at the moment.
repeats=3
n=10
seeds=rand(n)
wrapped_fn(vals)=run_chain(vals[1])
chains=pmap(wrapped_fn, Iterators.product(seeds,1:repeats))
# chains is n by 3
# can check if the 3 chains in each row are the same in parallel too, up to you
This really helps for doing a grid search of parameter, just like you would with a job array. Hopefully, this is helpful.
thanks for the tip, passing in seeds as a solution is unexpected but really helpful. Regarding sysimage, I remember that the packageCompiler doc mentioned that sysimage from one machine can’t be used on another? So would I need to create a system image on the cluster? Or can I make one on my computer and somehow configure it before passing it to the job array on the cluster?
I haven’t used the sysimage on my HPC at all, but since most clusters are designed to have the same architecture and have a shared filesystem, making it on one node (with something like an interactive slurm session) should be fine. When you connect to the other workers with addprocs there is a kwarg that is exeflags which may let you load a sysimage. This guide will hopefully have everything you need.
I may try it myself and see how it goes, as I imagine it would be very helpful.