CuArrays: Call reclaim from user code


I’m getting hit pretty hard with what I think is this issue:

When using CuArrays 1.2.1 I could keep things afloat fairly well by calling CuArrays.reclaim(true) at certain points, but it seems that in 1.3.0+ this functionality is moved inside the BinnedPool module.

On my system it seems like it is accessible through something like CuArrays.pool.x.reclaim(true), but this is a bit too many delimiters for my taste and I’m worried that the code might fail randomly on some other system due to CuArrays selecting another memory strategy (does it do that btw?).

Simple question is thus: Is there a “safe” way to call this method from user code?

Unfortunately I don’t have a MWE which I consider feasible for someone else to run (it takes about half a day to get into the bad situation and thus it would take quite long to distill it into a readable MWE).

The gist of what I’m doing is something like this:

for epoch in 1:nepochs
   for (mi, model) in enumerate(models)
       model_gpu = model |> gpu
       for data in dataset
           train!(model_gpu, data |> gpu)
       models[mi] = model_gpu |> cpu
       model_gpu = nothing # Not sure this would work, "real" code does not do it in this way
       # This is a good point to tell CuArrays to release memory

  models = updatemodels!(models)


Where models is an array of Flux models. Note that I wrote the code above “on the fly” and it might not work. The “real” code is tested and it works, but it is a bit more involved and indirect than the above snippet.

In case it is not obvious, keeping whole dataset and all models in the GPU memory leads to an OOM which is why I tried this strategy.

I understand this might be a bit unorthodox, but on the other hand, I don’t see why it should not be feasible to do assuming one is ok with the overhead from transferring models back and forth (which should be small compared to the training time).

It’s curious that calling reclaim manually improves performance, since this function is called anyway when CuArrays runs out of memory (that is, before calling into GC.gc, which is the main source of slowness). Furthermore, it only reclaims object which the Julia GC has already freed, so it’s not really giving up all memory that it could at the point you’re invoking the function (since the GC likely hasn’t run at that moment).

Anyhow, to answer your questions.

You can call CuArrays.BinnedPool.reclaim(true), no need to peek into the Ref there.

At this point it doesn’t, it’s just a mechanism to be able to experiment with new allocators to be made the default at some point.

At this point there isn’t but might make sense to make this an API that every allocator implements (to give up and release all cached memory). If you’re interested in that, please open an issue.

1 Like

Thanks alot for the prompt reply @maleadt

I forgot to say that I also call GC.gc() right before reclaim.

I originally did so to debug what turned out to be a kinda high level memory leak with stateful optimizers (Flux stores the state in a lazy dict using the weights arrays as key, but the gpu <-> cpu transfer creates new arrays so the state dict just kept growing).

After having adressed the issue it seemed that removing the GC + reclaim made the GC situation worse, but it is a bit hard to benchmark. I guess my naive theory is that since I can see how the GPU memory utilization really drops to zero when called “between models” this creates more of a “clean slate” for the next round (I understand this is probably not how it works though).

Anyways, I thanks for the advice. I will apply it and do some more testing to see if it really helps or not.

Thanks alot @maleadt

I was thinking I would wait with opening the issue until I had a chance to try out the GC improvements you made (although I guess it will take some time before I can try them out ).

In case it is of any value, below are some examples of what I see with “manual” reclaim vs “no manual” reclaim. This is in no way meant as criticism or “fix this now!”, just an honest attempt to be helpful. I’m very thankful for the work you have done and are doing with CuArrays!

If it would be useful to you if I did some more thorough experiments (e.g. keeping more things equal and store all time data in a file) I can of course do so.

Maybe worth noting is that I run the function in a loop from the shell, and the function exits if GC times has increased by more than 10% (ish) compared to the first round. GC times are always back to normal with each restart.

At first GC times are low and creep up slowly regardless of whether memory is “manually” reclaimed or not:

[ Info:         Train candidate 6 with 91 vertices
 36.360417 seconds (71.58 M allocations: 3.862 GiB, 8.80% gc time)
[ Info:         Train candidate 7 with 100 vertices
 49.108530 seconds (102.81 M allocations: 5.466 GiB, 10.16% gc time)
[ Info:         Train candidate 8 with 91 vertices
 41.673470 seconds (84.63 M allocations: 4.530 GiB, 10.98% gc time)
[ Info:         Train candidate 9 with 94 vertices
 27.739090 seconds (55.41 M allocations: 2.950 GiB, 10.79% gc time)
[ Info:         Train candidate 10 with 85 vertices
 45.485526 seconds (93.96 M allocations: 5.001 GiB, 12.03% gc time)

Then after about 20000 batches the “jump happens”.

Here there is a difference between using reclaim:

 45.427205 seconds (69.42 M allocations: 3.724 GiB, 22.84% gc time)
[ Info:         Train candidate 4 with 95 vertices
 43.208266 seconds (64.15 M allocations: 3.422 GiB, 22.83% gc time)
[ Info:         Train candidate 5 with 82 vertices
 36.491686 seconds (53.64 M allocations: 2.957 GiB, 23.66% gc time)
[ Info:         Train candidate 6 with 93 vertices
 60.582049 seconds (69.02 M allocations: 3.866 GiB, 42.05% gc time)
[ Info:         Train candidate 7 with 94 vertices
 63.754524 seconds (64.26 M allocations: 3.483 GiB, 47.92% gc time)
[ Info:         Train candidate 8 with 74 vertices
 60.069469 seconds (57.79 M allocations: 3.180 GiB, 51.73% gc time)
[ Info:         Train candidate 9 with 86 vertices
 62.742877 seconds (60.73 M allocations: 3.323 GiB, 49.91% gc time)
[ Info:         Train candidate 10 with 82 vertices
 60.139157 seconds (53.68 M allocations: 2.959 GiB, 52.78% gc time)

… and not using reclaim:

 30.346514 seconds (44.91 M allocations: 2.554 GiB, 24.12% gc time)
[ Info:         Train candidate 19 with 78 vertices
 37.947110 seconds (58.04 M allocations: 3.206 GiB, 24.83% gc time)
[ Info:         Train candidate 20 with 69 vertices
 37.558483 seconds (53.17 M allocations: 2.960 GiB, 25.86% gc time)
[ Info:         Train candidate 21 with 65 vertices
 93.943946 seconds (43.44 M allocations: 2.614 GiB, 76.08% gc time)
[ Info:         Train candidate 22 with 61 vertices
123.936988 seconds (38.71 M allocations: 2.203 GiB, 83.00% gc time)
[ Info:         Train candidate 23 with 78 vertices
136.634461 seconds (50.56 M allocations: 2.831 GiB, 81.34% gc time)

Whats worse is that when not using reclaim, every once in a while things like this happen:

[ Info:         Train candidate 36 with 75 vertices
1133.816895 seconds (48.94 M allocations: 2.750 GiB, 97.70% gc time)
[ Info:         Train candidate 37 with 74 vertices
273.130093 seconds (51.19 M allocations: 2.860 GiB, 90.36% gc time)
[ Info:         Train candidate 38 with 84 vertices
 83.438119 seconds (54.65 M allocations: 2.965 GiB, 64.49% gc time)
[ Info:         Train candidate 39 with 96 vertices
795.909398 seconds (78.17 M allocations: 4.172 GiB, 95.17% gc time)

This I never see with “manual” GC + reclaim.

It might just as well be my code which is causes this.