How to debug memory leak / GC issue

How do I debug a (supposed) memory leak or GC issue? I would like to understand which memory allocations are leaking or not removed by GC after use.

Context:
I’m experiencing a problem in which the memory use grows without bounds until Julia is killed by the OS. I have difficulty debugging the problem or coming up with a MWE. It seems to be related to the use of Zygote and/or Optim, but I can’t be sure. The reason I think that it is happening there is that I’m training a lot of models in parallel, and when I simply initialize parameters instead of optimizing them, the issue disappears. I don‘t know how to get any further.

The final vector of all trained models is much smaller than the total memory of my computer, as evidenced by the fact that running the loop sequentially does not exhibit the problem, and the final struct containing all models is only 6GB in size, while I have 96 GB of total RAM available. Apparently running the loop in parallel somehow prevents the GC from being triggered, or there is a bug that leaks memory somewhere deeper. Putting safepoints in the function that each thread is executing did not help, and running the GC manually upon reaching a given percentage of RAM usage works only in the early stages of the loop, after which the effect becomes smaller and thus GC ends up being triggered all the time and the execution speed suffers greatly.

2 Likes

Hi @simsurace ,

Did you have any luck with this?

The problem disappeared through some refactoring, but I never figured out why.