Hi,
I’m getting hit pretty hard with what I think is this issue: https://github.com/JuliaGPU/CuArrays.jl/issues/323
When using CuArrays 1.2.1 I could keep things afloat fairly well by calling CuArrays.reclaim(true)
at certain points, but it seems that in 1.3.0+ this functionality is moved inside the BinnedPool module.
On my system it seems like it is accessible through something like CuArrays.pool.x.reclaim(true)
, but this is a bit too many delimiters for my taste and I’m worried that the code might fail randomly on some other system due to CuArrays selecting another memory strategy (does it do that btw?).
Simple question is thus: Is there a “safe” way to call this method from user code?
Unfortunately I don’t have a MWE which I consider feasible for someone else to run (it takes about half a day to get into the bad situation and thus it would take quite long to distill it into a readable MWE).
The gist of what I’m doing is something like this:
for epoch in 1:nepochs
for (mi, model) in enumerate(models)
model_gpu = model |> gpu
for data in dataset
train!(model_gpu, data |> gpu)
end
models[mi] = model_gpu |> cpu
model_gpu = nothing # Not sure this would work, "real" code does not do it in this way
# This is a good point to tell CuArrays to release memory
end
models = updatemodels!(models)
end
Where models
is an array of Flux models. Note that I wrote the code above “on the fly” and it might not work. The “real” code is tested and it works, but it is a bit more involved and indirect than the above snippet.
In case it is not obvious, keeping whole dataset and all models in the GPU memory leads to an OOM which is why I tried this strategy.
I understand this might be a bit unorthodox, but on the other hand, I don’t see why it should not be feasible to do assuming one is ok with the overhead from transferring models back and forth (which should be small compared to the training time).