Memory usage increasing with each epoch

I’m having a problem where memory usage is gradually increasing with each epoch when training large neural networks with Flux (v0.14.22) and CUDA (v5.5.2). At the same time, training appears to get progressively slower as the memory usage grows. When this slowdown occurs, I observe that my GPU is being used much less effectively. For example, at the start of training I will see a constant usage of around 70%, which drops to 40% after a few epochs.

Upon further investigation, I found that this problem only seems to affect convolutional networks, while vision transformers are largely unaffected. In an attempt to track down the issue, I created the following MWE:

using Flux, Metalhead, Random, Statistics, CUDA, cuDNN, Match
using Pipe: @pipe

function build_model(config::Symbol)
	@match config begin
		:ResNet => Flux.Chain(
			Metalhead.ResNet(18).layers[1], 
			Flux.GlobalMeanPool(),
			Flux.MLUtils.flatten, 
			Flux.Dense(512 => 1, sigmoid))
		:MobileNet => Flux.Chain(
			Metalhead.MobileNetv3(:small).layers[1], 
			Flux.GlobalMeanPool(),
			Flux.MLUtils.flatten, 
			Flux.Dense(576 => 1024, hardswish),
			Flux.Dropout(0.2),
			Flux.Dense(1024 => 1, sigmoid))
		:ViT => Flux.Chain(
			Metalhead.ViT(:tiny, pretrain=false, patch_size=(16,16)).layers[1], 
			Flux.LayerNorm(192),
			Flux.Dense(192 => 1, Flux.sigmoid))
	end
end

loss(model, x, y) = @pipe model(x) |> Flux.binarycrossentropy(_, y)

free_memory() = round(Sys.free_memory() / 2^30, digits=2)

imgs = rand(Float32, 224, 224, 3, 10000)
labels = rand([0.0f0, 1.0f0], 1, 10000)

μ = mean(imgs, dims=(1, 2, 4))
σ = std(imgs, dims=(1, 2, 4))
norm_imgs = (imgs .- μ) ./ σ

data = Flux.DataLoader((norm_imgs, labels), batchsize=16, shuffle=true, buffer=true)

model = build_model(:ViT) |> Flux.gpu
	
opt = Flux.Optimisers.Adam()
opt_state = Flux.Optimisers.setup(opt, model)

for epoch in 1:10
	for (x, y) in CUDA.CuIterator(data)
		grads = Flux.gradient(m -> loss(m, x, y), model)
		Flux.Optimisers.update!(opt_state, model, grads[1])
	end
	@info free_memory()
end

Here’s the results for each of the different model architectures:

ViT:
46.99
47.0
46.96
46.93
46.91
46.88
46.89
46.94
46.86
46.7

ResNet18:
42.13
41.74
41.44
41.08
40.7
40.38
40.04
39.66
39.36
38.97

MobileNet:
37.89
37.02
36.28
35.42
34.62
33.88
33.04
32.25
31.45
30.65

As we can see, memory usage under ViT remains fairly constant throughout the training process, while ResNet18 and MobileNet both show a significant increase as training progresses. It looks like I’m not the first person to report this issue, but none of the proposed solutions appear to be working for me. Does anyone have any ideas what could be causing this?

This is a recurring problem with Flux and Julia in general. You can see similar topics here:

There can be few problems.

  • If you have type instability, it consumes memory.
  • Sometimes, you keep julia compiling new version of the same function. This happens usually when you do something like cat(x...) where x is an array or tuple of different length.
  • The, there might not be enough. pressure on GC. Try to run GC.gc() after every iteration. It will be slow, but the memory will not grow, you will see that this solves the problem.
2 Likes

Might be Unrelated try-catch causes CUDA arrays to not be freed · Issue #52533 · JuliaLang/julia · GitHub

There might not be enough pressure on GC. Try to run GC.gc() after every iteration. It will be slow, but the memory will not grow, you will see that this solves the problem.

I tried running GC.gc(); CUDA.reclaim() after every 100 steps, but it makes no difference. Memory continues to grow at the same rate as before.

Might be Unrelated try-catch causes CUDA arrays to not be freed · Issue #52533 · JuliaLang/julia · GitHub

I also tried removing logging with the same result. As I said, the problem only occurs with CNNs, not with transformers, while I think this issue would affect both.

setting environment variables like this

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "3.0GiB"

or

ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "3.0GiB"

can mitigate the issue

I tried again with a soft memory limit, but the results are the same as before. Here’s what happens to free memory with ResNet 18:

46.01
45.76
45.36
44.96
44.71
44.09
43.81
43.51
43.15
42.82

As we can see, the memory usage has grown by over 3GB after 6,250 steps, which is essentially identical to the previous result.

If I extrapolate these results to the dataset I’m trying to use, which contains around 1,000,000 samples, this results in a growth of around 30GB every epoch.

Depending on what’s causing the memory growth, I don’t think it can be extrapolated in such a way. For example, this could be CUDNN cache locking prevents finalizers resulting in OOMs · Issue #1461 · JuliaGPU/CUDA.jl · GitHub resurfacing.

For next steps, I would double-check that adding a hard mem limit + GC.gc(false) every 10ish steps doesn’t help memory usage. Another thing to do would be putting the training loop in its own function (ensuring that having these global vars or the loss function closure being re-created do not cause problems). If none of the above combined makes a difference, then there’s a deeper issue.

I’ve tried your suggestions by running Julia with julia --heap-size-hint=0.1G and modifying the training loop like so:

function train!(model, opt_state, data)
	for (i, (x, y)) in enumerate(CUDA.CuIterator(data))
		# Update Step
		step!(model, opt_state, x, y)
		
		# Free Memory
		if (i % 10) == 0
			GC.gc(false)#; CUDA.reclaim()
		end
	end
end

function step!(model, opt_state, x, y)
	grads = Flux.gradient(m -> Flux.binarycrossentropy(m(x), y), model)
	Flux.Optimisers.update!(opt_state, model, grads[1])
end

function train()
	# Construct Model
	model = build_model(:ResNet) |> Flux.gpu;
	
	# Initialize Optimizer
	opt = Flux.Optimisers.Adam()
	opt_state = Flux.Optimisers.setup(opt, model)

	# Train For 10 Epochs
	for epoch in 1:10
		train!(model, opt_state, data) # Train Model
		@info free_memory() # Log Free Memory
	end
end

train()

However, the results are exactly the same.

I also tried profile the allocations in the forward pass for ConvNeXt after the slow-down occurs:

0.106326 seconds (51.98 k CPU allocations: 2.554 MiB, 51.95% gc time) (574 GPU allocations: 3.951 GiB, 80.80% memmgmt time)

It appears that after running for a certain number of steps, the forward pass spends the majority of its time managing memory. The result is that GPU utilisation drops from around 70% at the start of training to 30% after the first epoch.

I’m still confused about why this doesn’t seem to happen with ViT or SWIN, but occurs with every CNN architecture that I’ve tried.

To clarify, I meant setting the JULIA_CUDA_HARD_MEMORY_LIMIT and JULIA_CUDA_SOFT_MEMORY_LIMIT environment variables as mentioned in Memory usage increasing with each epoch - #5 by CarloLucibello.

CNNs use conv functionality from cuDNN(.jl) which is internally quite a bit more complex because the CUDA API it wraps is more complex. Transformer models mostly do not rely on cuDNN.

Setting ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "6.0GiB" doesn’t change anything. I’m running an RTX 3080 Ti, so I can use up to 12GB of VRAM.

Ok, this clears up some confusion. Based on the first few posts I thought you were measuring GPU memory consumption increases over time, but it turns out this is about host memory? I think others assumed the same.

Instead of just printing out system memory (which wouldn’t include VRAM usage), can you also show the output of CUDA.pool_status() over time? Or better yet, if you’re able to observe GPU memory usage more frequently using a program of your choice (default system resource monitor, nvtop, etc).

1 Like