Debugging segfaults/running out of memory, maybe a memory leak?

Hello everyone! So I am using julia for a bunch of things in my project including a pretty simple simulation code which saves data using HDF5.jl. In addition, my simulation can be started in a range of starting conditions so I have a function that uses @threads to run an independent simulation from each starting point for a given setup each saving to its own HDF5 file.

To be specific all of my code is here where the src/GraphModels package is the main codebase (skip, but in case you need to look up types), src/GillespieSim is the actual simulation which does the HDF5 saving and has the mentioned threaded function (relatively short). Finally, the simulation itself is the cluster_env/runs/gil_firstrun_localtest/job.slurm slurm/bash script which uses the cluster_env/scripts/gmca_gil1.jl script which loads the mentioned packages. Perhaps the best function to start with would be this

Iā€™ve ran smaller variants of the script multiple times on my machine with no issues, everything runs and the files work fine too. But running the full, long version has caused me lots of problems. Iā€™ve been trying to run it on a slurm cluster where it has now twice failed after several hours, the slurm output from the last run is:

echo echo testing -> 20 20
get_variables(ca) = Num[Īµ_b, cATP, cP]

[24547] signal (11.1): Segmentation fault
in expression starting at none:1
H5FL_blk_free at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5FL_seq_free at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5B__node_dest at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5B_create at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
unknown function (ip: 0x7f99fcf91e59)
unknown function (ip: 0x7f99fcf91b9b)
unknown function (ip: 0x7f99fcf91b9b)
H5B_insert at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
unknown function (ip: 0x7f99fcfc963d)
unknown function (ip: 0x7f99fcfd3a3d)
unknown function (ip: 0x7f99fcfd45ed)
unknown function (ip: 0x7f99fcfd4d5b)
unknown function (ip: 0x7f99fcfd8571)
H5D__write at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5VL__native_dataset_write at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
H5VL_dataset_write_direct at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
unknown function (ip: 0x7f99fcfc208a)
H5Dwrite at /home/xucapjko/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so (unknown line)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

[24547] signal (6.-6): Aborted
in expression starting at none:1
/var/spool/slurmd/job28419/slurm_script: line 9: 24547 Aborted                 julia -t $SLURM_CPUS_PER_TASK -L ../../scripts/gmca_gil1.jl -e 'run_extreme()'

which doesnā€™t really help me and most of the hdf5 files are fine, but some (presumably the ones open when the code failed) are corrupted. Being frustrated with the cluster at this point Iā€™ve also tried running the full long sim on my machine, which interestingly got further than the cluster run but ultimately failed as well with absolutely no error log, it just said the process was killed. That said the system log includes the line Out of memory: Killed process 586266 (julia) total-vm:33854260kB... which checks out as the system was struggling with ram and the memory usage by the julia process kept growing.

So I suspect there is some memory leak?

But I have no real idea on what to do about it. I assumed with a garbage collector and given none of these sims actually need that much memory this would not be a problem. Is the problem coming from the use of the HDF5 library which does something weird or is it really a julia memory leak?

Any tips or advie would be greatly appreciated as this is being quite a problem for me at the moment!

One thing I tried now is adding a GC.gc() call to the @threads loop that runs each simulation and started the simulation again, so far the memory usage seems to have flattened off at ~5GB so maybe this has helped. Iā€™ll report if the job has finished or crashed.

I had a very similar issue in the past when trying to scale up simulations on a cluster. I ended up not using an @threads loop, but instead running a julia script with a ā€œbatchā€ and ā€œcontinueā€ argument to split the multiple independent simulations across different processes. I also first write the output and checkpoint files to a temporary name, and then atomically rename them after checking they have the correct content. When a memory leak, a power outage, or the cluster time limit eventually kills the process I can restart the simulation without effecting the results.

I made a package to help manage this GitHub - medyan-dev/MEDYANSimRunner.jl: Manage long running MEDYAN.jl simulations

1 Like

Are they big enough to the point of taking at least few seconds? if so can you see the memory usage in task manager increasing? If I were you Iā€™d try to confirm if there is a memory leak with smaller variants that still finish safely.

If there is a leak, the proper tool to use here would be a memory profiler

1 Like

Hi, sorry for the delay!

@nhz2 Thanks for the tip, I am somewhat worried if it is the @threads call but I somehow donā€™t quite think so at this point besides that maybe making memory management more complex. It is a fair point to just run the julia script multiple times but that both offloads more code into non-julia which Iā€™d prefer not to do and doubles all the memory/time overheads of having a julia process running.

@OmarElrefaei They are, but max maybe a minute and I donā€™t see any real memory usage increase over that time. I should say, the long run failed after ~5 hours with memory first going up slowly and then seemingly stable at about 8GB which along with my other processes about capped my RAM, I suspect the system only kept running thanks to swap. My laptop was notably slower and memory usage reported consistently at ~97% for 2-3 hours.

I should say a perhaps very important update. After I added the GC.gc() call to the loop (so I suppose being called in each thread, idk how that works tbh) the process actually finished! And along with that the memory usage did not go nearly as far. It kept raising for about an hour but then capped at ~5GB instead of the ~8-9GB from before. Is it expected behavior or general advice to be manually calling the garbage collector? It seems to make a big difference here but I was really quite surprised by that.

You are using HDF5.jl, which internally uses the HDF5 C library which is not generally thread-safe. Also, I donā€™t see any documentation in HDF5.jl that says it is safe to use with multiple threads.

The HDF5 C library also has its own GC system which doesnā€™t always play nicely with Juliaā€™s GC. Julia only automatically runs GC when running low on memory, but if HDF5 is internally allocating memory without telling Julia, you will need to call GC.gc() manually, or manually tell HDF5 to free unused memory.

1 Like

A GC Memory leak has been discoverd in the GC of 1.11 and 1.10

you might want to try it