Hello everyone! So I am using julia for a bunch of things in my project including a pretty simple simulation code which saves data using HDF5.jl. In addition, my simulation can be started in a range of starting conditions so I have a function that uses @threads
to run an independent simulation from each starting point for a given setup each saving to its own HDF5 file.
To be specific all of my code is here where the src/GraphModels
package is the main codebase (skip, but in case you need to look up types), src/GillespieSim
is the actual simulation which does the HDF5 saving and has the mentioned threaded function (relatively short). Finally, the simulation itself is the cluster_env/runs/gil_firstrun_localtest/job.slurm
slurm/bash script which uses the cluster_env/scripts/gmca_gil1.jl
script which loads the mentioned packages. Perhaps the best function to start with would be this
Iāve ran smaller variants of the script multiple times on my machine with no issues, everything runs and the files work fine too. But running the full, long version has caused me lots of problems. Iāve been trying to run it on a slurm cluster where it has now twice failed after several hours, the slurm output from the last run is:
echo echo testing -> 20 20
get_variables(ca) = Num[Īµ_b, cATP, cP]
[24547] signal (11.1): Segmentation fault
in expression starting at none:1
H5FL_blk_free at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5FL_seq_free at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5B__node_dest at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5B_create at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
unknown function (ip: 0x7f99fcf91e59)
unknown function (ip: 0x7f99fcf91b9b)
unknown function (ip: 0x7f99fcf91b9b)
H5B_insert at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
unknown function (ip: 0x7f99fcfc963d)
unknown function (ip: 0x7f99fcfd3a3d)
unknown function (ip: 0x7f99fcfd45ed)
unknown function (ip: 0x7f99fcfd4d5b)
unknown function (ip: 0x7f99fcfd8571)
H5D__write at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5VL__native_dataset_write at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5VL_dataset_write_direct at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
unknown function (ip: 0x7f99fcfc208a)
H5Dwrite at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[24547] signal (6.-6): Aborted
in expression starting at none:1
/var/spool/slurmd/job28419/slurm_script: line 9: 24547 Aborted julia -t $SLURM_CPUS_PER_TASK -L ../../scripts/gmca_gil1.jl -e 'run_extreme()'
which doesnāt really help me and most of the hdf5 files are fine, but some (presumably the ones open when the code failed) are corrupted. Being frustrated with the cluster at this point Iāve also tried running the full long sim on my machine, which interestingly got further than the cluster run but ultimately failed as well with absolutely no error log, it just said the process was killed. That said the system log includes the line Out of memory: Killed process 586266 (julia) total-vm:33854260kB...
which checks out as the system was struggling with ram and the memory usage by the julia process kept growing.
So I suspect there is some memory leak?
But I have no real idea on what to do about it. I assumed with a garbage collector and given none of these sims actually need that much memory this would not be a problem. Is the problem coming from the use of the HDF5 library which does something weird or is it really a julia memory leak?
Any tips or advie would be greatly appreciated as this is being quite a problem for me at the moment!
One thing I tried now is adding a GC.gc()
call to the @threads
loop that runs each simulation and started the simulation again, so far the memory usage seems to have flattened off at ~5GB so maybe this has helped. Iāll report if the job has finished or crashed.