Having some issues saving the output of Julia simulations (stochastic differential equations) in a SLURM cluster to JSON files. The cluster manager and I have not been able to figure this out. My original program runs many simulations and saves the output to JSON files in a specified directory. When that program was failing to save the JSON files anywhere, I wrote a minimal working example of a program that saves JSON files to a directory in a SLURM cluster:
using Distributed # Add some workers addprocs(25); @everywhere begin using JSON path="/path/to/save/directory/"; end # Save JSON files in a parallel for-loop @distributed for i=1:10 # Data data = [1, 2, 3, 4]; # Stringify string_data = JSON.json(data); # Save path new_path = path*"$i.json"; # Write JSON open(new_path, "w") do f write(f, string_data); end end # Kill the workers for j in workers() rmprocs(j); end
When I run this in the cluster as a batch job, it saves the JSON files to the desired directory no problem. My actual program looks something like the one below, and as you can see, with a few additions for the simulation component, it is very similar to the minimal working example:
# File name: # run_big_simulation.jl using Distributed # Add some workers addprocs(25); @everywhere begin using Random, Distributions using IterTools using JSON # -------------------------------- # # Define a bunch of functions here # # -------------------------------- # # -------------------------- # # Assign some variables here # # -------------------------- # # Cartesian product of intervals for each parameter # Defines the total parameter set to simulate param_sets = product(x1, x2, ...); # Path to directory to save JSON files to path="/path/to/save/directory/"; end # Run 1000 simulations for each parameter set @distributed for param in param_sets data = zeros(1000); # 1000 simulations in multi-threaded loop Threads@.threads for k in 1:1000 data[k] = run_simulation(param); end # Simulation info dictionary sim_info = Dict("parameters" => param, "data" => data); # Stringify string_data = JSON.json(sim_info); # Random ID # 62 possible chars, need X spaces for unique ID for Y param sets -> X = log_62(Y) par_ID = randstring(ceil(Int, log(62, length(param_sets)))); # Make sure save path is appropriate if !endswith(path, "/") new_path = path*"/"*"sim_"*par_ID*".json"; else new_path = path*"sim_"*par_ID*".json"; end # Write JSON open(new_path, "w") do f write(f, string_data); end end # Kill the workers for j in workers() rmprocs(j); end
I have tested this code (without the
@distributed) on my local machine and it runs without any errors and saves the JSON files appropriately, so I know there are no bugs in the script. But for some reason, when I run this in the cluster, the JSON files are not being saved (at least not in the expected directory). In case it matters, here’s the SLURM script I’m using to run the batch job:
#!/bin/bash #Submit this script with: sbatch thefilename #SBATCH --time=168:00:00 # walltime #SBATCH --ntasks=25 # number of processor cores (i.e. tasks) #SBATCH --cpus-per-task=50 #SBATCH --mem-per-cpu=4G # memory per CPU core #SBATCH -J "big_simulation" # job name #SBATCH --firstname.lastname@example.org # email address #SBATCH --mail-type=BEGIN #SBATCH --mail-type=END #SBATCH --mail-type=FAIL #SBATCH --gid=mygroup #SBATCH --error=big_sim.error #SBATCH --output=big_sim.out # LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE export JULIA_NUM_THREADS=50 julia run_big_simulation.jl
When I run the batch job, it completes with exit code 0 (no issues) and the error file for the job is empty. This makes it seem like no bugs occurred. However, it runs in less than a minute, and I know that even with the large amount of resources I requested from SLURM, this should not be the case. The
ProgressMeter package indicated that on my local machine with a fraction of these resources, the same program would take 4000 days to run. So I would expect the program to run for at least a few days or hours on the cluster.
Thanks ahead of time for any help in this matter. I’m sure there is some obvious error I am missing or something simple I haven’t accounted for.