Flux runs out of memory

LucasMSpereira · January 13, 2023, 12:34pm

A couple of months ago, I started training a network (the generator of a GAN) that resulted in GPU OOMs. It has ResNet blocks, long skip connections e squeeze-and-excitation blocks, with a total of more than 400M params. After some searches (CUDA.jl docs, issue 137, issue 149, PR 427, PR 33448, and this very discussion), this post talked about setting the environment variable ENV["JULIA_CUDA_MEMORY_POOL"] = "none" before including CUDA.jl. In my case, this was the only way to get past the first batch (of 64 samples). Then, testing other configurations, including GC.gc(); CUDA.reclaim() between batches as well resulted in a ~15% reduction in epoch time.

That’s the setup I’ve been using since. As a side node, I recently replicated my NN in pytorch to see if the performance is better. In the forward pass through this NN, more than 10GB are allocated in my GPU (RTX 2060, 12GB) and I get GPU OOM. I tried the same changes I used with Flux.jl (GC and no GPU cache), but there was no improvement.

Other adjustments on my pytorch implementation don’t seem to be trivial enough to be worth it, considering I can train my model in Flux. Still, it would have been helpful to have all this GPU OOM discussion (and news about it) concentrated in one place. The interactions between julia’s GC, CUDA.jl and Flux.jl are discussed from time to time. But, as a user, I’d like to know the common pitfalls and best practices when I first encounter this issue.

Topic		Replies	Views
Memory usage increasing with each epoch Machine Learning cuda , flux	18	727	April 14, 2025
`CUDA error: out of memory` with Flux Machine Learning flux	4	1645	August 24, 2020
Memory challenges for Flux on Resnet Machine Learning gpu	8	1378	September 7, 2022
State of deep learning in Julia Machine Learning	18	15652	September 28, 2019
Flux's model-zoo CIFAR10 example saturates 8GB gpu General Usage gpu , flux	5	652	June 29, 2020

Flux runs out of memory

Related topics