Having some issues saving the output of Julia simulations (stochastic differential equations) in a SLURM cluster to JSON files. The cluster manager and I have not been able to figure this out. My original program runs many simulations and saves the output to JSON files in a specified directory. When that program was failing to save the JSON files anywhere, I wrote a minimal working example of a program that saves JSON files to a directory in a SLURM cluster:
using Distributed
# Add some workers
addprocs(25);
@everywhere begin
using JSON
path="/path/to/save/directory/";
end
# Save JSON files in a parallel for-loop
@distributed for i=1:10
# Data
data = [1, 2, 3, 4];
# Stringify
string_data = JSON.json(data);
# Save path
new_path = path*"$i.json";
# Write JSON
open(new_path, "w") do f
write(f, string_data);
end
end
# Kill the workers
for j in workers()
rmprocs(j);
end
When I run this in the cluster as a batch job, it saves the JSON files to the desired directory no problem. My actual program looks something like the one below, and as you can see, with a few additions for the simulation component, it is very similar to the minimal working example:
# File name:
# run_big_simulation.jl
using Distributed
# Add some workers
addprocs(25);
@everywhere begin
using Random, Distributions
using IterTools
using JSON
# -------------------------------- #
# Define a bunch of functions here #
# -------------------------------- #
# -------------------------- #
# Assign some variables here #
# -------------------------- #
# Cartesian product of intervals for each parameter
# Defines the total parameter set to simulate
param_sets = product(x1, x2, ...);
# Path to directory to save JSON files to
path="/path/to/save/directory/";
end
# Run 1000 simulations for each parameter set
@distributed for param in param_sets
data = zeros(1000);
# 1000 simulations in multi-threaded loop
Threads@.threads for k in 1:1000
data[k] = run_simulation(param);
end
# Simulation info dictionary
sim_info = Dict("parameters" => param, "data" => data);
# Stringify
string_data = JSON.json(sim_info);
# Random ID
# 62 possible chars, need X spaces for unique ID for Y param sets -> X = log_62(Y)
par_ID = randstring(ceil(Int, log(62, length(param_sets))));
# Make sure save path is appropriate
if !endswith(path, "/")
new_path = path*"/"*"sim_"*par_ID*".json";
else
new_path = path*"sim_"*par_ID*".json";
end
# Write JSON
open(new_path, "w") do f
write(f, string_data);
end
end
# Kill the workers
for j in workers()
rmprocs(j);
end
I have tested this code (without the @everywhere
and @distributed
) on my local machine and it runs without any errors and saves the JSON files appropriately, so I know there are no bugs in the script. But for some reason, when I run this in the cluster, the JSON files are not being saved (at least not in the expected directory). In case it matters, here’s the SLURM script I’m using to run the batch job:
#!/bin/bash
#Submit this script with: sbatch thefilename
#SBATCH --time=168:00:00 # walltime
#SBATCH --ntasks=25 # number of processor cores (i.e. tasks)
#SBATCH --cpus-per-task=50
#SBATCH --mem-per-cpu=4G # memory per CPU core
#SBATCH -J "big_simulation" # job name
#SBATCH --mail-user=myemail@university.edu # email address
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --gid=mygroup
#SBATCH --error=big_sim.error
#SBATCH --output=big_sim.out
# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
export JULIA_NUM_THREADS=50
julia run_big_simulation.jl
When I run the batch job, it completes with exit code 0 (no issues) and the error file for the job is empty. This makes it seem like no bugs occurred. However, it runs in less than a minute, and I know that even with the large amount of resources I requested from SLURM, this should not be the case. The ProgressMeter
package indicated that on my local machine with a fraction of these resources, the same program would take 4000 days to run. So I would expect the program to run for at least a few days or hours on the cluster.
Thanks ahead of time for any help in this matter. I’m sure there is some obvious error I am missing or something simple I haven’t accounted for.