Memory challenges for Flux on Resnet

I’m attempting to reproduce Resnet (from Metalhead.jl) on ImageNet performance through simple demos for a single GPU setup (I’m on a RTX A4000, 16Bg, ~6000 CUDA cores) and stumbled upon 2 memory related challenges.

A first concern was that the ability to train on large batches was limited compared to other framework. For exmaple, in this Gluon/MXNet tutorial, Resnet 50 is trained wirth a batch size of 64 per GPU (256 split into 4 GPUs) with 12Gb each. In the following smaller reproducible example, resnet-base-curand.jl, batch size had to be limited to 20, which is over 3X smaller while having 16Gb memory rather than 12Gb. Although I’m aware that Julia isn’t currently super greedy on memory mgmt, are such limitations also expected on GPU code where the bulk should be CUDNN wrapers to convolution operators?

Another issue came from the build up of CPU RAM during training. This happens only when the program performs both the image read and transform plus the Flux gradiant pass. For example, there’s no RAM issue when performing only the image loading step such as in test-loader.jl nor in the above Flux only steps. I could thus only reproduce with resnet-base.jl which requires to have the ImageNet available on the machine.
CPU RAM grows up steadily through batches, starting from around 12Gb up to the full 64Gb available after roughly 1000 batches.

I’m unclear whether such memory leakage more likely concerns DataLoaders or Zygote as on one hand, the model gradient pass should not involve much CPU, though on the other, the test running only the DataLoaders without any Flux model works fine. Also, the CPU memory usage doesn’t seem to build up when launching with a single thread, as opposed to 6-8 threads which are needed for decent training speed. This memory caveat can be avoided by adding batch % 200 == 0 && GC.gc(true) within the loop for (batch, (x, y)) in enumerate(CuIterator(dtrain)), though having to add such step seems an anomaly. Could this issue be related to some of the GC topics recently discussed at JuliaCon?

Finally, I’ve yet to match published results on Resnet34 and 50, although I got fairly close with 66%-68% top 1 accuracy. I’d be curious to know if anyone has had success doing so on a single GPU setup and would have a reprocible script. Such model fitting feels like a 101 for a DL framework, and so I’d have interest to see a Flux reproduible recipe to do that.

For info, 1 epoch for Resnet34 takes 3600 secs which I consider good, though it climbs to 7200 sec with resnet50, which is likely due to the limited batch size.


Using ENV["JULIA_CUDA_MEMORY_POOL"] = "none" allows to train a Resnet50 model with larger batch size. I could actuallygo up to 64 on a 16gb A4000 (compared to max batch size 20 with default memory pool). However, disabling CUDA memory pool results in a significant 3X slowdown.

With memory pool - resnet50 - batch size 20
70.152556 seconds (37.26 M allocations: 2.238 GiB, 1.74% gc time)

Without memory pool - resnet50 - batch size 20
213.827493 seconds (37.20 M allocations: 2.235 GiB, 0.65% gc time)

(code for the experiement is here).

Is there a way to get best of both world: accomode large batch sizes such as the memory pool disabling allows, while retaining the performance associated with the memory pool? @maleadt


FWIW it is also my experience that pytorch GPU memory management beats julia by 2x on big convnet workloads. Personally my solution is to use pytorch in such situations. But I would love to come back to julia for these tasks. See also here.


I think there are some missed opportunities to save memory by using things like conv_bias_act! instead of separate broadcasts. At most a factor of 2, IIRC. Needs someone to push it along.

Regarding conv_bias_act!, it looks like the forward pass that is defined for CUDNN in NNlibCUDA. It states on relu activation is supported, so I’m assuming it gets picked for Resnet.

However, I’m unclear about how the backward pass is handled as I couldn find any sort of ∇conv_bias_act!. Would the remaining work would relate to defining rrule for conv_bias_act!?

Sorry I should have put a link: #346.

The dense half of that PR is disappointing. It still saves memory but not time, and better broadcasting can get you half the saving I think anyway. Diffractor + CR broadcasting:

julia> @btime gradient((w,b) -> sum(bias_act!(relu, w, b)), $w, $b);
  min 17.250 μs, mean 27.097 μs (57 allocations, 80.49 KiB)

julia> @btime gradient((w,b) -> sum(relu.(w .+ b)), $w, $b);
  min 12.375 μs, mean 28.489 μs (45 allocations, 119.42 KiB)

That’s on the CPU, not sure I did serious GPU timing, maybe things would be different.

I think this isn’t used at all by Flux right now, for the lack of gradient handling.

This is also done in that PR, and I thought it worked, but haven’t looked again. IIRC conv_bias_act! forwards made quite a difference on GPU. The PR aims to use it as much as possible.


Thanks for the clarifications! Sorry about the tangent question, but is Diffractor now meant to be put in use? I thought Enzyme was where everyone were headed, but also heard Jeff mentionning Diffractor at JuliaCon so I’m confused here!

It’s good to get stuff like conv_bias_act in, but I think it’s more important to figure out the source of these OOMs so that we’re not just kicking the can down the road for larger models. Is it fragmentation in the memory pool? CUDA libraries allocating outside the purview of the pool? Something else entirely?

After discussing with @maleadt and @MirekKratochvil on Slack, another idea was to tune the memory pool release threshold to try to find a happy medium. Can you try the following?

# ... device!(0) ...
release_threshold = 0
attribute!(memory_pool(device()), CUDA.MEMPOOL_ATTR_RELEASE_THRESHOLD, UInt(release_threshold))

I would also try release thresholds of 1MB, 100MB and 1GB if you have the chance.

1 Like