Which library supports a non-allocating neural network model

Hi all,

I have been using Flux.jl to create a simple model. I do not need to take gradients; I only need to evaluate a loss for a given set of parameters. Currently I use a setup similiar to this:

using BenchmarkTools
features, labels = generate_data()
model = Flux.Chain(....)
ps, re_fn = Flux.destructure(model)
function loss_fn(parameters)
    mdl = re_fn(parameters)
    return Flux.Losses.logitcrossentropy(mdl(features), labels)
end

If I benchmark on a GPU for a given dataset:

@btime loss_fn($ps)

I get around 1ms, but it allocates a lot of memory. When I run this loss function in parallel (using the same GPU), this allocation eventually fills up the VRAM and crashes the program when the GPU is out of memory. I am not sure if this is a bug in CUDA.jl or a product of the way Flux.destructure works, which doesn’t allow CUDA.jl to properly garbage collect the GPU memory.

Because of this, I was looking at alternative frameworks which do not allocate, and treat the parameters like a flat vector. I wrote up an example with SimpleChains.jl and it runs much faster than the Flux.jl CPU model (around 20ms compared to 220ms Flux CPU), and barely allocates any memory which is really great, but I am not sure that SimpleChains.jl supports the GPU.

Are there any DL libraries which allow me to evaluate a loss for a given set of parameters, without allocating memory?

3 Likes

Also interested in the answer! I’m wondering if something like this is in the works for Lux.jl @avikpal ?

1 Like

Can I ask why you’re using destructure and what “running in parallel” means? The reason is because you should never use destructure unless you absolutely have to (e.g. compat with SciML libraries, hypernetworks). As for running in parallel, if that means multithreading I would suspect either the Julia GC not being able to keep up or CUDA.jl not handling memory pressure from multiple threads well.

How large is your model and input data? If SimpleChains is running much faster, I suspect you’ll gain little from using a GPU. Flux and Lux allocate a lot because they don’t (actually can’t) mutate any inputs or outputs that could be differentiated. Future compiler optimizations like WIP: ImmutableArrays (using EA in Base) by ianatol · Pull Request #44381 · JuliaLang/julia · GitHub seek to ease that burden, but I assume you don’t want to wait that long!

1 Like

Not really. Lux is actually worse than Flux when it comes to allocation. I had tried adding a cached version of layers in around v0.2 (I think) but the gains were relatively minimal in the cases I was playing with.

Overall, I think the long term plan is to leverage compiler optimizations especially in case of Lux where we know that the functions are pure (to most extent except IO side effects).

2 Likes

I need to use destructure for my research, as I need the parameters of the model as a flat vector to manipulate. Previously, I wrote a wrapper to index the parameters themselves within the flux layers, but this was extremely slow.

In terms of parallel, I do mean multithreading. As a pseudocode example:

function run()
    # generate state
    for i in 1:epochs
        mutate!(state)
        l = loss_fn(state)
        # etc
    end
    # save results etc
end

function run_all()
    Threads.@threads for i in samples
        results[i] = run()
    end
end

I cannot run in multiprocessing (distributed) on the same GPU, since it runs out of memory and errors much faster. I explored this a while ago and was told to use threads instead, which has the same issue, but doesn’t run into it as quickly. The main reason I want to avoid allocation is not really for speed, but it is so that I can run in parallel without pressuring the GC.

SimpleChains is only faster on the CPU, Flux on the GPU is 20-30 times faster, the only issue is that the GPU is only running a small model (few thousand parameters), and has the capacity to run multiple at the same time, if the GC wasn’t such an issue.

Although Lux won’t help with allocations from running the model as mentioned above, it seems like a better fit for this because you can store all the parameters in a ComponentArray without de/restructuring. I couldn’t find a definitive docs link for this, but I’m sure @avikpal has one.

On the GPU memory management side, one thing you could try is setting JULIA_CUDA_MEMORY_POOL=none and/or the other tricks mentioned in Memory management · CUDA.jl. Running full GC + CUDA.reclaim() regularly is not exactly fast, but if it’s the difference between your model running or not it may be worth it.

Lastly, a crazy idea if all else fails: Most operations in Flux models call a corresponding mutating function in NNlib under the hood. Since you’re not taking gradients, you could call those directly. It would be a little like using torch.nn.functional or JAX without a layers library in that parameter + output arrays would have to be passed in manually, but it would also give you almost complete control over allocations.

1 Like

Thanks! I’ll update this post when I have tried using JULIA_CUDA_MEMORY_POOL=none with the results.

1 Like

Unfortunately this didn’t work, I still get:

 nested task error: Out of GPU memory trying to allocate 84.500 MiB
    Effective GPU memory usage: 99.56% (23.592 GiB/23.697 GiB)
    No memory pool is in use.

But this error happens a lot quicker than when using the memory pool.

A hipshot comment. Maybe you can tune the threshold for when to free memory so that it works:

attribute!(memory_pool(device()), CUDA.MEMPOOL_ATTR_RELEASE_THRESHOLD, UInt64(9_000_000_000))

Also experiment with the number of threads. Have you done any analysis on what the worst case memory consumption is as a function of the number of threads?

Note that reading memory util stats from the outside can be a bit misleading as both Julia and CUDA do internal memory management.

Thanks for the suggestion! Unfortunately, I’ve tried it and still no luck, even when varying the number of threads and modifying the threshold.

On a single thread, given enough iterations (i.e. evaluations of that loss_fn), the memory of the card will be filled, but it never errors on a single thread as the GC can seem to keep up. As an estimate, I think that if the code was non allocating, I would need less than 1GB of memory per thread to run this.

Have you tried any of the suggestions listed after that? Full GC + reclamation in particular will force the GC to clean up whether it wants to or not.

Have you tried using https://github.com/oxinabox/AutoPreallocation.jl ?

AutoPrealloc doesn’t work with GPU arrays, and last I checked it can allocate more on 1.7+…

I tried this a while ago and unfortunately did not work.