I have been using Flux.jl to create a simple model. I do not need to take gradients; I only need to evaluate a loss for a given set of parameters. Currently I use a setup similiar to this:
using BenchmarkTools
features, labels = generate_data()
model = Flux.Chain(....)
ps, re_fn = Flux.destructure(model)
function loss_fn(parameters)
mdl = re_fn(parameters)
return Flux.Losses.logitcrossentropy(mdl(features), labels)
end
If I benchmark on a GPU for a given dataset:
@btime loss_fn($ps)
I get around 1ms, but it allocates a lot of memory. When I run this loss function in parallel (using the same GPU), this allocation eventually fills up the VRAM and crashes the program when the GPU is out of memory. I am not sure if this is a bug in CUDA.jl or a product of the way Flux.destructure works, which doesn’t allow CUDA.jl to properly garbage collect the GPU memory.
Because of this, I was looking at alternative frameworks which do not allocate, and treat the parameters like a flat vector. I wrote up an example with SimpleChains.jl and it runs much faster than the Flux.jl CPU model (around 20ms compared to 220ms Flux CPU), and barely allocates any memory which is really great, but I am not sure that SimpleChains.jl supports the GPU.
Are there any DL libraries which allow me to evaluate a loss for a given set of parameters, without allocating memory?
Can I ask why you’re using destructure and what “running in parallel” means? The reason is because you should never use destructure unless you absolutely have to (e.g. compat with SciML libraries, hypernetworks). As for running in parallel, if that means multithreading I would suspect either the Julia GC not being able to keep up or CUDA.jl not handling memory pressure from multiple threads well.
How large is your model and input data? If SimpleChains is running much faster, I suspect you’ll gain little from using a GPU. Flux and Lux allocate a lot because they don’t (actually can’t) mutate any inputs or outputs that could be differentiated. Future compiler optimizations like WIP: ImmutableArrays (using EA in Base) by ianatol · Pull Request #44381 · JuliaLang/julia · GitHub seek to ease that burden, but I assume you don’t want to wait that long!
Not really. Lux is actually worse than Flux when it comes to allocation. I had tried adding a cached version of layers in around v0.2 (I think) but the gains were relatively minimal in the cases I was playing with.
Overall, I think the long term plan is to leverage compiler optimizations especially in case of Lux where we know that the functions are pure (to most extent except IO side effects).
I need to use destructure for my research, as I need the parameters of the model as a flat vector to manipulate. Previously, I wrote a wrapper to index the parameters themselves within the flux layers, but this was extremely slow.
In terms of parallel, I do mean multithreading. As a pseudocode example:
function run()
# generate state
for i in 1:epochs
mutate!(state)
l = loss_fn(state)
# etc
end
# save results etc
end
function run_all()
Threads.@threads for i in samples
results[i] = run()
end
end
I cannot run in multiprocessing (distributed) on the same GPU, since it runs out of memory and errors much faster. I explored this a while ago and was told to use threads instead, which has the same issue, but doesn’t run into it as quickly. The main reason I want to avoid allocation is not really for speed, but it is so that I can run in parallel without pressuring the GC.
SimpleChains is only faster on the CPU, Flux on the GPU is 20-30 times faster, the only issue is that the GPU is only running a small model (few thousand parameters), and has the capacity to run multiple at the same time, if the GC wasn’t such an issue.
Although Lux won’t help with allocations from running the model as mentioned above, it seems like a better fit for this because you can store all the parameters in a ComponentArray without de/restructuring. I couldn’t find a definitive docs link for this, but I’m sure @avikpal has one.
On the GPU memory management side, one thing you could try is setting JULIA_CUDA_MEMORY_POOL=none and/or the other tricks mentioned in Memory management · CUDA.jl. Running full GC + CUDA.reclaim() regularly is not exactly fast, but if it’s the difference between your model running or not it may be worth it.
Lastly, a crazy idea if all else fails: Most operations in Flux models call a corresponding mutating function in NNlib under the hood. Since you’re not taking gradients, you could call those directly. It would be a little like using torch.nn.functional or JAX without a layers library in that parameter + output arrays would have to be passed in manually, but it would also give you almost complete control over allocations.
nested task error: Out of GPU memory trying to allocate 84.500 MiB
Effective GPU memory usage: 99.56% (23.592 GiB/23.697 GiB)
No memory pool is in use.
But this error happens a lot quicker than when using the memory pool.
Also experiment with the number of threads. Have you done any analysis on what the worst case memory consumption is as a function of the number of threads?
Note that reading memory util stats from the outside can be a bit misleading as both Julia and CUDA do internal memory management.
Thanks for the suggestion! Unfortunately, I’ve tried it and still no luck, even when varying the number of threads and modifying the threshold.
On a single thread, given enough iterations (i.e. evaluations of that loss_fn), the memory of the card will be filled, but it never errors on a single thread as the GC can seem to keep up. As an estimate, I think that if the code was non allocating, I would need less than 1GB of memory per thread to run this.