Writing Data to JSON Files on HPC Cluster

Having some issues saving the output of Julia simulations (stochastic differential equations) in a SLURM cluster to JSON files. The cluster manager and I have not been able to figure this out. My original program runs many simulations and saves the output to JSON files in a specified directory. When that program was failing to save the JSON files anywhere, I wrote a minimal working example of a program that saves JSON files to a directory in a SLURM cluster:

using Distributed

# Add some workers
addprocs(25);

@everywhere begin
    using JSON
    path="/path/to/save/directory/";
end

# Save JSON files in a parallel for-loop
@distributed for i=1:10
    # Data
    data = [1, 2, 3, 4];
    # Stringify
    string_data = JSON.json(data);
    # Save path
    new_path = path*"$i.json";
    # Write JSON
    open(new_path, "w") do f
        write(f, string_data);
    end
end

# Kill the workers
for j in workers()
    rmprocs(j);
end

When I run this in the cluster as a batch job, it saves the JSON files to the desired directory no problem. My actual program looks something like the one below, and as you can see, with a few additions for the simulation component, it is very similar to the minimal working example:

# File name:
# run_big_simulation.jl
using Distributed

# Add some workers
addprocs(25);

@everywhere begin
    using Random, Distributions
    using IterTools
    using JSON

    # -------------------------------- #
    # Define a bunch of functions here #
    # -------------------------------- #

    # -------------------------- #
    # Assign some variables here #
    # -------------------------- #
    
    # Cartesian product of intervals for each parameter
    # Defines the total parameter set to simulate
    param_sets = product(x1, x2, ...);
    
    # Path to directory to save JSON files to
    path="/path/to/save/directory/";
end

# Run 1000 simulations for each parameter set
@distributed for param in param_sets
    data = zeros(1000);
    # 1000 simulations in multi-threaded loop
    Threads@.threads for k in 1:1000
        data[k] = run_simulation(param);
    end

    # Simulation info dictionary
    sim_info = Dict("parameters" => param, "data" => data);

    # Stringify
    string_data = JSON.json(sim_info);

    # Random ID
    # 62 possible chars, need X spaces for unique ID for Y param sets -> X = log_62(Y)
    par_ID = randstring(ceil(Int, log(62, length(param_sets))));

    # Make sure save path is appropriate
    if !endswith(path, "/")
        new_path = path*"/"*"sim_"*par_ID*".json";
    else
        new_path = path*"sim_"*par_ID*".json";
    end

    # Write JSON
    open(new_path, "w") do f
        write(f, string_data);
    end
end

# Kill the workers
for j in workers()
    rmprocs(j);
end

I have tested this code (without the @everywhere and @distributed) on my local machine and it runs without any errors and saves the JSON files appropriately, so I know there are no bugs in the script. But for some reason, when I run this in the cluster, the JSON files are not being saved (at least not in the expected directory). In case it matters, here’s the SLURM script I’m using to run the batch job:

#!/bin/bash

#Submit this script with: sbatch thefilename

#SBATCH --time=168:00:00                  # walltime
#SBATCH --ntasks=25                       # number of processor cores (i.e. tasks)
#SBATCH --cpus-per-task=50
#SBATCH --mem-per-cpu=4G                  # memory per CPU core
#SBATCH -J "big_simulation"          # job name
#SBATCH --mail-user=myemail@university.edu   # email address
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --gid=mygroup
#SBATCH --error=big_sim.error
#SBATCH --output=big_sim.out

# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
export JULIA_NUM_THREADS=50

julia run_big_simulation.jl

When I run the batch job, it completes with exit code 0 (no issues) and the error file for the job is empty. This makes it seem like no bugs occurred. However, it runs in less than a minute, and I know that even with the large amount of resources I requested from SLURM, this should not be the case. The ProgressMeter package indicated that on my local machine with a fraction of these resources, the same program would take 4000 days to run. So I would expect the program to run for at least a few days or hours on the cluster.

Thanks ahead of time for any help in this matter. I’m sure there is some obvious error I am missing or something simple I haven’t accounted for.

Not sure if helpful, but I’ve had good luck using the BSON (binary JSON) package to write .bson files (from a Fortran PDE solver wrapped in Julia) on a SLURM cluster which I then analyze on my local linux box.

In my case, all the parallelization is in the Fortran code, and Julia is just the driver/wrapper, so its a bit different then your case. I have not used Distributed, but have done parallel programming, and this quick google search seems to contain your answer

https://stackoverflow.com/questions/60373801/how-do-i-do-i-o-in-julia-distributed-for-loop-run-non-interactively

BSON.jl

1 Like

Somehow I missed that Stackoverflow post while searching around for answers on Google! That seems like it should solve it. I’ll do some tests later and if everything works out, I’ll select your response as the solution. Thanks so much, this is really helpful.

Parallel programming is hard. I banged my head hard against synchronous/asynchronous stuff in graduate school two decades ago… I was hoping that by now Compilers would take care of this automagically… But at least Julia is a step in the right direction (compared to the nightmare of parallel Fortran/C++/OpenMP/MPI code). Glad I could help!

1 Like

The @sync is helpful for sure, so I’m going to select this as the solution, but good to note that the main reason my code in the original post does not work is actually because I’m using the @distributed macro on a for-loop over a product iterator which fails because @distributed expects the iterator to implement getindex which the product iterator does not. The program fails silently unless you include the @sync macro which causes the program to report an error upon failure.

https://github.com/JuliaLang/julia/issues/33998

https://github.com/JuliaLang/julia/issues/30343