Memory usage increasing with each epoch

I’m having a problem where memory usage is gradually increasing with each epoch when training large neural networks with Flux (v0.14.22) and CUDA (v5.5.2). At the same time, training appears to get progressively slower as the memory usage grows. When this slowdown occurs, I observe that my GPU is being used much less effectively. For example, at the start of training I will see a constant usage of around 70%, which drops to 40% after a few epochs.

Upon further investigation, I found that this problem only seems to affect convolutional networks, while vision transformers are largely unaffected. In an attempt to track down the issue, I created the following MWE:

using Flux, Metalhead, Random, Statistics, CUDA, cuDNN, Match
using Pipe: @pipe

function build_model(config::Symbol)
	@match config begin
		:ResNet => Flux.Chain(
			Metalhead.ResNet(18).layers[1], 
			Flux.GlobalMeanPool(),
			Flux.MLUtils.flatten, 
			Flux.Dense(512 => 1, sigmoid))
		:MobileNet => Flux.Chain(
			Metalhead.MobileNetv3(:small).layers[1], 
			Flux.GlobalMeanPool(),
			Flux.MLUtils.flatten, 
			Flux.Dense(576 => 1024, hardswish),
			Flux.Dropout(0.2),
			Flux.Dense(1024 => 1, sigmoid))
		:ViT => Flux.Chain(
			Metalhead.ViT(:tiny, pretrain=false, patch_size=(16,16)).layers[1], 
			Flux.LayerNorm(192),
			Flux.Dense(192 => 1, Flux.sigmoid))
	end
end

loss(model, x, y) = @pipe model(x) |> Flux.binarycrossentropy(_, y)

free_memory() = round(Sys.free_memory() / 2^30, digits=2)

imgs = rand(Float32, 224, 224, 3, 10000)
labels = rand([0.0f0, 1.0f0], 1, 10000)

μ = mean(imgs, dims=(1, 2, 4))
σ = std(imgs, dims=(1, 2, 4))
norm_imgs = (imgs .- μ) ./ σ

data = Flux.DataLoader((norm_imgs, labels), batchsize=16, shuffle=true, buffer=true)

model = build_model(:ViT) |> Flux.gpu
	
opt = Flux.Optimisers.Adam()
opt_state = Flux.Optimisers.setup(opt, model)

for epoch in 1:10
	for (x, y) in CUDA.CuIterator(data)
		grads = Flux.gradient(m -> loss(m, x, y), model)
		Flux.Optimisers.update!(opt_state, model, grads[1])
	end
	@info free_memory()
end

Here’s the results for each of the different model architectures:

ViT:
46.99
47.0
46.96
46.93
46.91
46.88
46.89
46.94
46.86
46.7

ResNet18:
42.13
41.74
41.44
41.08
40.7
40.38
40.04
39.66
39.36
38.97

MobileNet:
37.89
37.02
36.28
35.42
34.62
33.88
33.04
32.25
31.45
30.65

As we can see, memory usage under ViT remains fairly constant throughout the training process, while ResNet18 and MobileNet both show a significant increase as training progresses. It looks like I’m not the first person to report this issue, but none of the proposed solutions appear to be working for me. Does anyone have any ideas what could be causing this?

This is a recurring problem with Flux and Julia in general. You can see similar topics here:

There can be few problems.

  • If you have type instability, it consumes memory.
  • Sometimes, you keep julia compiling new version of the same function. This happens usually when you do something like cat(x...) where x is an array or tuple of different length.
  • The, there might not be enough. pressure on GC. Try to run GC.gc() after every iteration. It will be slow, but the memory will not grow, you will see that this solves the problem.
2 Likes

Might be Unrelated try-catch causes CUDA arrays to not be freed · Issue #52533 · JuliaLang/julia · GitHub

There might not be enough pressure on GC. Try to run GC.gc() after every iteration. It will be slow, but the memory will not grow, you will see that this solves the problem.

I tried running GC.gc(); CUDA.reclaim() after every 100 steps, but it makes no difference. Memory continues to grow at the same rate as before.

Might be Unrelated try-catch causes CUDA arrays to not be freed · Issue #52533 · JuliaLang/julia · GitHub

I also tried removing logging with the same result. As I said, the problem only occurs with CNNs, not with transformers, while I think this issue would affect both.

setting environment variables like this

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "3.0GiB"

or

ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "3.0GiB"

can mitigate the issue

I tried again with a soft memory limit, but the results are the same as before. Here’s what happens to free memory with ResNet 18:

46.01
45.76
45.36
44.96
44.71
44.09
43.81
43.51
43.15
42.82

As we can see, the memory usage has grown by over 3GB after 6,250 steps, which is essentially identical to the previous result.

If I extrapolate these results to the dataset I’m trying to use, which contains around 1,000,000 samples, this results in a growth of around 30GB every epoch.

Depending on what’s causing the memory growth, I don’t think it can be extrapolated in such a way. For example, this could be CUDNN cache locking prevents finalizers resulting in OOMs · Issue #1461 · JuliaGPU/CUDA.jl · GitHub resurfacing.

For next steps, I would double-check that adding a hard mem limit + GC.gc(false) every 10ish steps doesn’t help memory usage. Another thing to do would be putting the training loop in its own function (ensuring that having these global vars or the loss function closure being re-created do not cause problems). If none of the above combined makes a difference, then there’s a deeper issue.

I’ve tried your suggestions by running Julia with julia --heap-size-hint=0.1G and modifying the training loop like so:

function train!(model, opt_state, data)
	for (i, (x, y)) in enumerate(CUDA.CuIterator(data))
		# Update Step
		step!(model, opt_state, x, y)
		
		# Free Memory
		if (i % 10) == 0
			GC.gc(false)#; CUDA.reclaim()
		end
	end
end

function step!(model, opt_state, x, y)
	grads = Flux.gradient(m -> Flux.binarycrossentropy(m(x), y), model)
	Flux.Optimisers.update!(opt_state, model, grads[1])
end

function train()
	# Construct Model
	model = build_model(:ResNet) |> Flux.gpu;
	
	# Initialize Optimizer
	opt = Flux.Optimisers.Adam()
	opt_state = Flux.Optimisers.setup(opt, model)

	# Train For 10 Epochs
	for epoch in 1:10
		train!(model, opt_state, data) # Train Model
		@info free_memory() # Log Free Memory
	end
end

train()

However, the results are exactly the same.

I also tried profile the allocations in the forward pass for ConvNeXt after the slow-down occurs:

0.106326 seconds (51.98 k CPU allocations: 2.554 MiB, 51.95% gc time) (574 GPU allocations: 3.951 GiB, 80.80% memmgmt time)

It appears that after running for a certain number of steps, the forward pass spends the majority of its time managing memory. The result is that GPU utilisation drops from around 70% at the start of training to 30% after the first epoch.

I’m still confused about why this doesn’t seem to happen with ViT or SWIN, but occurs with every CNN architecture that I’ve tried.

To clarify, I meant setting the JULIA_CUDA_HARD_MEMORY_LIMIT and JULIA_CUDA_SOFT_MEMORY_LIMIT environment variables as mentioned in Memory usage increasing with each epoch - #5 by CarloLucibello.

CNNs use conv functionality from cuDNN(.jl) which is internally quite a bit more complex because the CUDA API it wraps is more complex. Transformer models mostly do not rely on cuDNN.

Setting ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "6.0GiB" doesn’t change anything. I’m running an RTX 3080 Ti, so I can use up to 12GB of VRAM.

Ok, this clears up some confusion. Based on the first few posts I thought you were measuring GPU memory consumption increases over time, but it turns out this is about host memory? I think others assumed the same.

Instead of just printing out system memory (which wouldn’t include VRAM usage), can you also show the output of CUDA.pool_status() over time? Or better yet, if you’re able to observe GPU memory usage more frequently using a program of your choice (default system resource monitor, nvtop, etc).

2 Likes

Yes, I was talking about host memory.

So after several weeks of experiments, I’ve observed the following:

  1. Increasing the batch size seems to alleviate the problem for smaller networks. For example, increasing the batch size from 16 to 256 for ResNet18 results in memory usage periodically decreasing after gradually increasing for multiple epochs. Maybe large batches put more pressure on the GC? Unfortunately, this won’t work for larger models like ResNet50, where the largest batch size I can use is 32 before getting an OOM error.
  2. After training for 50 epochs on ImageNet, I found that transformers are affected, but to a much smaller degree than convolutional models. For example, memory consumption with ViT-B only increases by a few GB after 10 epochs, compared to ResNet50, which grows by over 40GB in the same time period.
  3. VRAM usage remains constant from start to finish, while host memory gradually grows. At a certain point, GPU usage declines significantly and CUDA.@time shows that memory management is taking up a significant chunk of resources. Unless I use a very large batch size, memory usage will continue to grow until the system is forced to use swap memory. Note that the slowdown occurs before this point, so it isn’t due to slow disk-reads.

Here’s the result of calling CUDA.@time on a forward pass at the start of training:

0.039036 seconds (50.43 k CPU allocations: 4.101 MiB, 10.63% gc time) (633 GPU allocations: 4.283 GiB, 5.60% memmgmt time)
0.041121 seconds (50.43 k CPU allocations: 4.098 MiB, 16.88% gc time) (633 GPU allocations: 4.283 GiB, 5.21% memmgmt time)
0.041562 seconds (50.43 k CPU allocations: 4.095 MiB, 16.48% gc time) (633 GPU allocations: 4.283 GiB, 5.11% memmgmt time)
0.039033 seconds (50.43 k CPU allocations: 4.089 MiB, 9.66% gc time) (633 GPU allocations: 4.283 GiB, 5.12% memmgmt time)
0.041621 seconds (50.43 k CPU allocations: 4.101 MiB, 16.08% gc time) (633 GPU allocations: 4.283 GiB, 5.13% memmgmt time)
0.041593 seconds (50.43 k CPU allocations: 4.098 MiB, 16.70% gc time) (633 GPU allocations: 4.283 GiB, 4.79% memmgmt time)
0.038278 seconds (50.43 k CPU allocations: 4.095 MiB, 9.60% gc time) (633 GPU allocations: 4.283 GiB, 5.24% memmgmt time)
0.041335 seconds (50.43 k CPU allocations: 4.083 MiB, 16.41% gc time) (633 GPU allocations: 4.283 GiB, 4.77% memmgmt time)

And this is after 10 epochs:

0.511448 seconds (50.63 k CPU allocations: 4.139 MiB, 91.59% gc time) (633 GPU allocations: 4.282 GiB, 94.36% memmgmt time)
0.200058 seconds (50.43 k CPU allocations: 4.123 MiB, 79.52% gc time) (633 GPU allocations: 4.282 GiB, 1.73% memmgmt time)
0.436915 seconds (51.15 k CPU allocations: 4.150 MiB, 91.40% gc time) (633 GPU allocations: 4.282 GiB, 62.98% memmgmt time)
0.200881 seconds (50.77 k CPU allocations: 4.134 MiB, 79.05% gc time) (633 GPU allocations: 4.282 GiB, 82.68% memmgmt time)
0.229609 seconds (50.76 k CPU allocations: 4.131 MiB, 82.70% gc time) (633 GPU allocations: 4.282 GiB, 1.25% memmgmt time)
0.179896 seconds (50.77 k CPU allocations: 4.130 MiB, 77.77% gc time) (633 GPU allocations: 4.282 GiB, 79.87% memmgmt time)
0.172918 seconds (50.75 k CPU allocations: 4.125 MiB, 77.78% gc time) (633 GPU allocations: 4.282 GiB, 1.52% memmgmt time)

Here’s the output of nvtop at the start of training and after 10 epochs on ImageNet with ViT-B:


As you can see, GPU memory is the same, but host memory has grown from 3579MB to 10254MB. The results are much worse with convolutionial models, which require me to restart the process every couple of epochs due to memory growth and training slow-down.

What does the code run to get these results look like? If there are any modifications from the original code, please make sure they’re reflected in the new code snippet here.

Another thing to look at would be taking a heap snapshot after 10+ epochs: Profiling · The Julia Language. VS Code and Chromium-based browsers can both view them. Clearly something is sticking around and Julia is refusing to GC it.

2 Likes

After analysing the heap, I think I’ve finally figured out the problem.

Here’s a screenshot from the heap snapshot:

As we can see, a large number of debug logs are being stored, adding up to 341 MB of memory.

It seems that these logs are being produced by cuDNN and are then being output by PlutoLogger:

These logs are reproduced for every forward pass, which quickly adds up to a substantial amount of memory. I believe that the reason ViT is less affected is because it only contains a single convolutional layer for the patch embedding.

To test this theory, I tried converting the notebook to a standard Julia program that I can execute in the terminal. After running for 10 epochs, I can report that memory usage remains stable throughout the training process as seen in the following logs:

[ Info: 36.74
Effective GPU memory usage: 73.48% (8.566 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 37.03
Effective GPU memory usage: 73.40% (8.557 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 37.02
Effective GPU memory usage: 73.46% (8.564 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 37.02
Effective GPU memory usage: 73.92% (8.617 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 37.02
Effective GPU memory usage: 73.75% (8.598 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 36.99
Effective GPU memory usage: 73.74% (8.597 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 37.0
Effective GPU memory usage: 73.71% (8.593 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 36.98
Effective GPU memory usage: 73.72% (8.595 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 36.97
Effective GPU memory usage: 73.60% (8.581 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)
[ Info: 36.96
Effective GPU memory usage: 73.58% (8.578 GiB/11.658 GiB)
Memory pool usage: 1.530 GiB (7.906 GiB reserved)

Now that we know what’s causing the issue, how can I go about resolving it? It looks like the current master branch of cuDNN.jl registers a callback to log debug messages in the __init__ method:

function __init__()
    precompiling = ccall(:jl_generating_output, Cint, ()) != 0

    CUDA.functional() || return

    # find the library
    global libcudnn
    if CUDA.local_toolkit
        dirs = CUDA_Runtime_Discovery.find_toolkit()
        path = CUDA_Runtime_Discovery.get_library(dirs, "cudnn"; optional=true)
        if path === nothing
            precompiling || @error "cuDNN is not available on your system (looked in $(join(dirs, ", ")))"
            return
        end
        libcudnn = path
    else
        if !CUDNN_jll.is_available()
            precompiling || @error "cuDNN is not available for your platform ($(Base.BinaryPlatforms.triplet(CUDNN_jll.host_platform)))"
            return
        end
        libcudnn = CUDNN_jll.libcudnn
    end

    # register a log callback
    if !precompiling && (isdebug(:init, cuDNN) || Base.JLOptions().debug_level >= 2)
        log_cond[] = Base.AsyncCondition() do async_cond
            message = Base.@lock log_lock popfirst!(log_messages)
            _log_message(message...)
        end

        callback = @cfunction(log_message, Nothing,
                              (cudnnSeverity_t, Ptr{Cvoid}, Ptr{cudnnDebug_t}, Ptr{UInt8}))
        cudnnSetCallback(typemax(UInt32), log_cond[], callback)
    end

    _initialized[] = true
end

This seems to be an unusual design choice, and it isn’t obvious to me how one would go about suppressing such messages from within Pluto.

5 Likes

Great investigation! Debug logging should be off by default, so if you’re not enabling it manually then something else must be. I’m not familiar enough with Pluto to know if it does, but that may be worth looking into.

Just a wild guess, but isn‘t Pluto perhaps messing with the usual assumptions for precompilation by doing its own pkg thing?