Garbage collection not aggressive enough on Slurm Cluster

I’ve been using slurm for awhile w/ julia. I’ve only recently been running into issues where I’m running out of memory for long running processes (like 30hrs). The jobs are highly regular, effectively calling the same function over and over on the same model (i.e. a reinforcement learning agent). When I’m testing locally I don’t see any obvious memory leaks. What I’ve hypothesized is that on the cluster julia doesn’t know about the boundary slurm is putting up around memory and thus not garbage collecting as aggressively as I need. Does anyone know if Julia respects the memory boundaries slurm enforces? And if not, does anyone know how to get julia to respect them?

The current fix proposed by someone on the Julia Slack is to add calls to the garbage collector every hour or so (which has solved the issue afaict). I’ve done this in my current experiments, but this seems like a temporary fix. Any other ideas?

This has been raised before Should memory in worker be freed after fetching result?, and a solution to add the ability to set the max memory was discussed here command line flag to limit heap memory usage? · Issue #17987 · JuliaLang/julia · GitHub quite awhile ago.

14 Likes

That’s strange, I am also using Julia on a Slurm cluster and I have never encountered such an issue. On the contrary, when I lower the amount of memory per cpu, I typically get close to full memory efficiency while gc time increases.
Unfortunately, I cannot directly help with this problem as I am no expert, but maybe the difference in our use cases helps to identify the issue?
My particular use involves using Threads over all cores of a node and distributing equivalent jobs over workers via pmap. As a result, I do not have that many distinct workers. Do you know if your problem somehow depends on the number of workers, i.e is there still a memory leak if you only use one worker?
In any case, I hope you find a solution ,
Salmon

Edit: To clarify, I am only using one instance of pmap() in a program, so if the issue is freeing memory after a worker is completely finished, this might be why it works in my case.

1 Like

I think the difference is down to

My particular use involves using Threads over all cores of a node and distributing equivalent jobs over workers via pmap.

My use case is likely much less sophisticated. I’m a machine learning and reinforcement learning researcher, and I do lots of empirical work. So I’ve built a system that takes an iterable of parameters and runs many experiments all in parallel. I don’t need to have this run on a single node! Technically, it can be spread throughout the entire cluster (so I don’t have to wait for entire nodes to be free to get through the queue), which has been my setup up until now. I’ve since moved to scheduling full nodes, as it makes this issue go away.

Another issue is likely the use of Flux/Zygote + BPTT. I’ve noticed Zygote can allocate a lot of data, which might not clean up very quickly leading to the issue happening over time. It is possible that this is only problematic when I’m using Flux/Zygote! I’m not too sure.

But it would be nice to go back to my previous workflow at some point, so I can schedule experiments more quickly. But it is a bit onerous given I would need to call the GC in my experiment at unknown intervals.

2 Likes

I’ve found a similar sounding topic that has not been linked yet: https://github.com/JuliaLang/julia/issues/28887. I am not entirely sure what the consquence of calling addprocs with the lazy = false argument is, but maybe its something easy to try out.

I assume explicitily calling the gc at specific points in your function rather than time-based does not work in your case, but maybe you could check for free memory with Sys.free_memory() as in the example above to find out if a gc call is necessary?

1 Like

This is definitely something to try after my deadline. I’m not sure what the lazy flag does, tbh :smile:.

As for Sys.free_memory(), I think that gives total system memory (which might be what GC is using to know when to run?). I’m not sure slurm changes what is returned by that function (as the boundaries between jobs is not a hard, but a soft as far as I can tell). Definitely something to test and see.

Thank you for the suggestions!

1 Like

I just want to chime in that I am experiencing similar behavior as OP in a similar environment. (I use a SLURM job array task to parallelize my computation.) My exact application is solving a value function via backward recursion. The parallelization is across different observations in my data (i.e. each observation has its own value function) with N=2300. Each observation is structured the same (i.e. differs only with regards to values of its variables), so the memory requirements shouldn’t differ across the SLURM job array.

When I run this for any given observation of data, there is no memory leak. But when I try to run the full set simultaneously, I receive an out-of-memory error for a large fraction (25%-75%) of the observations. The problem persists even if I raise the memory allocation per task. It also occurs in Julia 1.4.2 and 1.6.1.

I am not sure what the performance implications of calling GC are, but I’ll give it a shot.

Why not just call GC before each model run? typically GC takes 50-100ms, I assume your model runs take … a lot longer (seconds) So the overhead of doing this kind of thing might be very acceptable:

for parameter in myparams
   GC.gc(true) # minimal overhead if next step takes seconds
   modelrun(parameter)
end

Maybe I’m not understanding what you are saying. It is more like modelrun can take hours to days, so the gc issue comes not from running lots of different settings but from the allocations of say Flux/Zygote inside the function modelrun(parameter). Next, if I took your advice and put the GC.gc(true) on every iteration of my optimization cycle something can go from say 2 seconds to 30 minutes. That kind of runtime hit is not feasible. So I’m again left with “at what frequency should I call GC.gc” inside my experiment to manage runtime and growing memory.

I think your way of doing things could work when the modelrun function does not take on the order of hours/days. And indeed I do this for all my experiments. The problem is when modelrun itself is on the order of hours/days.

1 Like

We had the similar issue and solve it by periodically invoking GC.gc() manually every few iterations. Not an ideal solution, but it worked.

Needless to say, our ML jobs suffers by garbage collection taking a lot of time.

Gotcha, that context is more informative thanks.

Perhaps instead of running on every iteration of your optimization cycle you just do a tunable fraction at random

thr=0.02
rand(Uniform(0,1)) < thr && GC.gc();
3 Likes

Offtopic: How does rand(Uniform(0,1)) differ from just rand()?

yeah I guess it doesn’t, just being explicit.

1 Like

This looks like a promising strategy. Thanks for sharing!

This still leaves it up to me to tune, but it works. I think what would be nice is to have a way to set max memory for a julia process, but I’m not sure this has buyin from the core julia team right now.

This specific change is unlikely, since the brisket issue of multi-threaded GC is a bigger issue that needs attention anyway. I think it is very likely that julia’s GC will get a lot of work in the near future which hopefully will make hacks like this unnecessary.

3 Likes

You could try using Base.gc_bytes() to determine whether to run gc or not. Also @timed will give you stats on what happened during the gc. You could use these things to “autotune”. I realize it’d be nice not to have to do that, but it wouldn’t be crazy hard. It might be as easy as keeping track of the exponential weighted moving avg GC time per loop and running a GC if it gets too low.

budgettime = 0.005 # around 5ms per loop should be spent in GC on avg
avgtime = 0.0
for i in 1:myiters
   thistime = (avgtime < budgettime) ? @timed( GC.gc()).time :  0.0
   avgtime = .75 * avgtime + .25 * thistime
   runmodeliter()
end

It looks like Base.gc_bytes() is a monotonic counter of total allocated bytes, I’m not sure but if that’s correct but if it is… something like:

gcafter = 5e9 ## GC after 5GB have been allocated
lastgc = Base.gc_bytes()
for i in 1:myiters
   nowbytes = Base.gc_bytes()
   (nowbytes-lastgc) > gcafter && begin Base.gc(true); lastgc = Base.gc_bytes() end
   runmodeliter()
end
1 Like

Here’a piece of code that increases Base.gc_bytes(), but never triggers any GC runs. So at least in this situation, it seems inappropriate to manually invoke a GC run just because Base.gc_bytes() has increased by a large amount.

const a = Int[]
for i in 1:10^7
   global a
   resize!(a,0)
   resize!(a,10^6)
end

You can check with the @time macro that no time is ever spent on GC, no matter how many times you run the code above, but Base.gc_bytes() increases every time you run the code. Presumably resize!() directly calls malloc() and free() of the underlying C runtime, and does not need GC.

It’s certainly not a strategy for general use, but in the OPs case it might work. There the problem is already known to be that GC needs to be called. The question is just when? I kinda like the ewma time budget version though