Flux runs out of memory

sheevy · December 13, 2021, 7:17pm

I’m running training conv net on 2000x2000 pixel image with Flux/Zygote on RTX 3070 (mobile) with 8GB of RAM and I’m getting out of memory exception when gradients are calculated:

ERROR: LoadError: Out of GPU memory trying to allocate 641.462 MiB
Effective GPU memory usage: 96.70% (7.735 GiB/8.000 GiB)
Memory pool usage: 5.872 GiB (6.000 GiB reserved)

My MWE:

using CUDA
using Flux

device = gpu

img = rand(Float32, 2000, 2000, 1, 1) |> gpu

model = Chain(
    Conv((5, 5), 1=>16, sigmoid),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu)
) |> gpu

ps = params(model)
loss = Flux.mse
opt = ADAM(0.001)

test = model(img)

target = rand(Float32, size(test)...) |> gpu

for epoch in 1:1
    gs = Flux.gradient(ps) do
            loss(model(img), target)
        end

    Flux.Optimise.update!(opt, ps, gs)
end

Am I doing somethin stupid? I would not expectect this NN architecture to not fit into the memory.

ToucheSir · December 13, 2021, 9:19pm

The model itself comfortably fits into memory, but I wonder if the activations and intermediate allocations might be pushing things over the edge.

# img = rand(Float32, 25, 25, 1, 1) |> gpu; test = model(img); target = rand(Float32, size(test)...) |> gpu;
julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
 33.297392 seconds (86.89 M CPU allocations: 4.516 GiB, 4.16% gc time) (60 GPU allocations: 13.179 MiB, 0.00% memmgmt time)

julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
  0.716580 seconds (68.97 k CPU allocations: 3.692 MiB) (48 GPU allocations: 444.008 KiB, 0.01% memmgmt time)

Switching to the large input still OOMs after this, but I can make things work by forcing a full GC and telling CUDA to clear cached memory.

julia> GC.gc(true); CUDA.reclaim();

julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
  2.497996 seconds (136.12 k CPU allocations: 7.278 MiB, 31.56% gc time) (71 GPU allocations: 12.529 GiB, 10.02% memmgmt time)

julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
  1.724317 seconds (80.52 k CPU allocations: 4.334 MiB, 44.30% gc time) (67 GPU allocations: 9.652 GiB, 4.16% memmgmt time)

This is on an RTX 2070, also with 8GB of VRAM.

So the first pass allocates quite a bit more (perhaps because it’s compiling the gradient machinery as it goes). It’s not a great workaround, but warming up with a small input, aggressively collecting memory and then using the large input works.

The bigger question is where all these allocations come from. Assuming optimal memory (re)use, we have (2000, 2000, 1, 1) -> Conv((5, 5), 1=>16) -> (1996, 1996, 16, 1) for the first layer. Each subsequent layer is (W, W, 16, 1) -> Conv((5, 5), 16=>16) -> (W-4, W-4, 16, 1). In total, that’s 6 activation arrays with ~378 million items. x4 bytes because we’re working with Float32, that’s around 1.4GiB. Model params and optimizer state are negligible in comparison (<10MiB, I believe). I can’t account for all of the allocations either (back of the napkin math says 48 for forward, backwards and parameters), so CuDNN may be doing quite a bit of it behind the scenes (+18 CuDNN calls for forwards and backwards would be 66, much closer to the reported 67-71).

mcabbott · December 13, 2021, 9:37pm

If you are feeling adventurous, you could try out NNlib#346 on this. Which tries to re-arrange how Conv works, to save not quite half of the memory.

ToucheSir · December 13, 2021, 10:24pm

After playing around a bit more, one culprit for poor memory reclamation may be the use of globals. This version does not require any calls to GC or CUDA.reclaim():

using CUDA
using Flux

cpu_model = Chain(
    Conv((5, 5), 1=>16, sigmoid),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu)
)
model = gpu(cpu_model)
ps = params(model)

grad(model, ps, img, target) = gradient(() -> Flux.mse(model(img), target), ps)

let img_small = CUDA.rand(Float32, 25, 25, 1, 1), target_small = CUDA.rand(Float32, 1, 1, 16, 1)
    grad(model, ps, img_small, target_small)
end

let
    img = CUDA.rand(Float32, 2000, 2000, 1, 1)
    target = CUDA.rand(Float32, Flux.outputsize(cpu_model, size(img))...)
    CUDA.@profile for i in 1:5
        NVTX.@range "iter $i" begin
            grad(model, ps, img, target)
        end
    end
end

The @profile and @range macros are there because I looked at the results in Nsight Systems, and you can too! As expected, the GC will occasionally kick off and bring memory usage back in line.

sheevy · December 16, 2021, 11:12am

Thanks @ToucheSir. I’ve tried both solutions, copied the code verbatim, but it always resulted in the OOM. I have also checked the behaviour on my other machine with RTX 2080 (desktop) on both Win10 and Linux. Still no luck.

BTW, my Manifest.toml looks like this: Manifest.toml - Pastebin.com
CUDA version is:

julia> CUDA.version()
v"11.6.0"

I’d like to note that for me I cannot even execute a single iteration of the loop which contains grad(...) call. It’s not like I need to do 5 iteration (as in your example) to fill up the memory to trigger OOM when GC kicks in too late.

I guess if I don’t find a better solution I’ll need to try to make the model smaller by increasing the stride and/or adding pooling.

ToucheSir · December 16, 2021, 8:03pm

Can you confirm that the model works on the small input? If not, that’s likely a CUDA issue.

Another thing to keep in mind is that your system is consuming VRAM as well. I ran my tests on a machine that doesn’t usually have any kind of desktop session active. If nvidia-smi reports more than a few hundred MB being used when Julia is not running, then you may just not have enough overall.

cirobr · November 5, 2022, 8:58pm

Cheers,

I am having a somehow similar issue when training a model on my RTX2060 GPU with 6 GB RAM and CUDA 12.0.0. When model is loaded, about 1.125 GB is occupied. Then, when loading the first datapoint through a minibatch, memory load goes slightly up to 1.129 GB. At this point gradient starts calculation and remains there for quite a long time, until it runs out of memory.

Have tried many things unsuccessfully, including disabling memory pool, reclaiming memory as frequent as possible, etc.

Relevant portion of the code as follows:

TrainBatch!(loss, ps, ds, opt)
    for (xtrain_batch, ytrain_batch) in CUDA.CuIterator(ds)
        Xb, yb = xtrain_batch, ytrain_batch
        gs = Flux.gradient(() -> loss(Xb, yb), ps)
        Flux.Optimise.update!(opt, ps, gs)
    end    
end

η = 0.01           # learning rate (\eta<tab>)
lossFunction(X, y) = Flux.crossentropy(model(X)[1], yloss(y))
modelParameters    = Flux.params(model)
dataset            = Flux.DataLoader((X, y), batchsize=1)
modelOptimizer     = Flux.Descent(η);
callBack           = Flux.throttle(() -> println("."), 10);

TrainBatch!(lossFunction, modelParameters, dataset, modelOptimizer)

Error dump is as follows:

ERROR: LoadError: CUDA error: unknown error (code 999, ERROR_UNKNOWN)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/error.jl:97 [inlined]
  [3] cuMemAlloc_v2
    @ ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [inlined]
  [4] alloc(::Type{CUDA.Mem.DeviceBuffer}, bytesize::Int64; async::Bool, stream::CuStream, pool::Nothing)
    @ CUDA.Mem ~/.julia/packages/CUDA/DfvRa/lib/cudadrv/memory.jl:86
  [5] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:41 [inlined]
  [6] macro expansion
    @ ./timing.jl:382 [inlined]
  [7] actual_alloc(bytes::Int64; async::Bool, stream::CuStream)
    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/pool.jl:39
  [8] macro expansion
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:232 [inlined]
  [9] macro expansion
    @ ./timing.jl:382 [inlined]
 [10] #_alloc#170
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:313 [inlined]
 [11] #alloc#169
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:299 [inlined]
 [12] alloc
    @ ~/.julia/packages/CUDA/DfvRa/src/pool.jl:293 [inlined]
 [13] CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}(#unused#::UndefInitializer, dims::NTuple{4, Int64})    @ CUDA ~/.julia/packages/CUDA/DfvRa/src/array.jl:42
 [14] CuArray
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:125 [inlined]
 [15] CuArray
    @ ~/.julia/packages/CUDA/DfvRa/src/array.jl:136 [inlined]
 [16] similar
    @ ./abstractarray.jl:841 [inlined]
 [17] similar
    @ ./abstractarray.jl:840 [inlined]
 [18] similar
    @ ~/.julia/packages/CUDA/DfvRa/src/broadcast.jl:11 [inlined]
 [19] copy
    @ ~/.julia/packages/GPUArrays/fqD8z/src/host/broadcast.jl:37 [inlined]
 [20] materialize
    @ ./broadcast.jl:860 [inlined]
 [21] (::NNlib.var"#60#63"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}})()
    @ NNlib ~/.julia/packages/NNlib/Rh36b/src/activations.jl:885
 [22] unthunk
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/tangent_types/thunks.jl:204 [inlined]
 [23] unthunk
    @ ~/.julia/packages/ChainRulesCore/C73ay/src/tangent_types/thunks.jl:237 [inlined]
 [24] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:105 [inlined]
 [25] map
    @ ./tuple.jl:223 [inlined]
 [26] wrap_chainrules_output
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:106 [inlined]
 [27] ZBack
    @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:206 [inlined]
 [28] Pullback
    @ ~/.julia/packages/Flux/nJ0IB/src/layers/conv.jl:200 [inlined]
 [29] macro expansion
    @ ~/.julia/packages/Flux/nJ0IB/src/layers/basic.jl:53 [inlined]
 [30] Pullback
    @ ~/.julia/packages/Flux/nJ0IB/src/layers/basic.jl:53 [inlined]
 [31] (::typeof(∂(_applychain)))(Δ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [32] Pullback
    @ ~/.julia/packages/Flux/nJ0IB/src/layers/basic.jl:51 [inlined]
 [33] (::typeof(∂(λ)))(Δ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
--- the last 5 lines are repeated 3 more times ---
 [49] Pullback
    @ ~/projects/pesquisa/unetflux/exp01/unet-memory.jl:75 [inlined]
 [50] (::typeof(∂(lossFunction)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [51] Pullback
    @ ~/projects/pesquisa/unetflux/libteste.jl:23 [inlined]
 [52] (::typeof(∂(λ)))(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0
 [53] (::Zygote.var"#99#100"{Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, typeof(∂(λ)), Zygote.Context{true}})(Δ::Float32)
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:378
 [54] gradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:97
 [55] TrainBatch!(loss::typeof(lossFunction), ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, ds::MLUtils.DataLoader{Tuple{Array{Float32, 4}, Array{Float32, 3}}, Random._GLOBAL_RNG, Val{nothing}}, opt::Descent)
    @ Main.Libteste ~/projects/pesquisa/unetflux/libteste.jl:23
in expression starting at /home/ciro/projects/pesquisa/unetflux/exp01/unet-memory.jl:87

Any direction is appreciated. Thanks.

ToucheSir · November 5, 2022, 10:22pm

What is the maximum memory consumption you see when running the forward pass of your model? A good rule of thumb is that the full gradient call (forward + reverse) will take at least 3x more, so it’s not surprising that you’d run into OOMs.

This is probably compilation of the model and its backwards pass. You can isolate this time by running a very small dummy input through as a warmup before trying with full-sized inputs.

cirobr · November 8, 2022, 7:43am

Thanks for the tip, hypothesis is confirmed. Working now on the best solution.

Regards,

cirobr · January 8, 2023, 6:06pm

Hello, any improvement in the GPU OOM arena when running a simple pass of gradient descent?

Since my post above, I’ve made numerous improvements to my conv model, but OOM is still an issue. For instance, to warm up the model with a small input does work “most of the time” to avoid OOM. However, depending on the model, the small input may not be able to fire up a significant number of neurons - which could lead under certain circumstances to model convergence problems.

By checking the example model that originated this post, which is quite modest in size, one might think that something needs improvement under de hood. Is there any forum that that the topic is (or should be) discussed for improvement and collaboration?

Thanks.

ToucheSir · January 8, 2023, 6:13pm

It’s not possible to answer this precisely without a MWE, but do note that CUDA.jl 3.12.1 should dramatically reduce the likelihood of CNNs OOMing. Would recommend giving that a try.

cirobr · January 8, 2023, 6:19pm

I am already running CUDA 3.12.1 and that indeed improved somehow OOM problems.

mcabbott · January 8, 2023, 7:04pm

We did some fiddling a few weeks ago. Can you try ] add https://github.com/mcabbott/Flux.jl#chain_rrule and see if this helps?

cirobr · January 8, 2023, 10:05pm

My model runs BatchNorm() following each convolution. I got the following error:

LoadError: AssertionError: BatchNorm: input has wrong number of channels

Will give later a try by removing BatchNorm(), and see what happens.

jeremiedb · January 9, 2023, 5:55am

FWIW, here is MWE for training Resnet from scratch on Imagenet: ImageNetTrain.jl/resnet-optim.jl at main · jeremiedb/ImageNetTrain.jl · GitHub. It works fine with batchsize 128 on Resnet34 (16 GB RAM GPU).

Following lines: ImageNetTrain.jl/resnet-optim.jl at b7cc19676a74525d9b4ec007435f2ff9c892c604 · jeremiedb/ImageNetTrain.jl · GitHub are meant to integrate part of what @mcabbott developped further in the referred chain_rrule branch.

With regard to other potential trick to help, you may consider wrapping the data loader with CuIterator: ImageNetTrain.jl/resnet-optim.jl at b7cc19676a74525d9b4ec007435f2ff9c892c604 · jeremiedb/ImageNetTrain.jl · GitHub

LucasMSpereira · January 13, 2023, 12:34pm

A couple of months ago, I started training a network (the generator of a GAN) that resulted in GPU OOMs. It has ResNet blocks, long skip connections e squeeze-and-excitation blocks, with a total of more than 400M params. After some searches (CUDA.jl docs, issue 137, issue 149, PR 427, PR 33448, and this very discussion), this post talked about setting the environment variable ENV["JULIA_CUDA_MEMORY_POOL"] = "none" before including CUDA.jl. In my case, this was the only way to get past the first batch (of 64 samples). Then, testing other configurations, including GC.gc(); CUDA.reclaim() between batches as well resulted in a ~15% reduction in epoch time.

That’s the setup I’ve been using since. As a side node, I recently replicated my NN in pytorch to see if the performance is better. In the forward pass through this NN, more than 10GB are allocated in my GPU (RTX 2060, 12GB) and I get GPU OOM. I tried the same changes I used with Flux.jl (GC and no GPU cache), but there was no improvement.

Other adjustments on my pytorch implementation don’t seem to be trivial enough to be worth it, considering I can train my model in Flux. Still, it would have been helpful to have all this GPU OOM discussion (and news about it) concentrated in one place. The interactions between julia’s GC, CUDA.jl and Flux.jl are discussed from time to time. But, as a user, I’d like to know the common pitfalls and best practices when I first encounter this issue.

ToucheSir · January 13, 2023, 3:29pm

Have you tried CUDA >= 3.12.1 with the memory pool re-enabled? It would be good to know how well Reclaim in cuDNN conv algorithm search by ToucheSir · Pull Request #1711 · JuliaGPU/CUDA.jl · GitHub works.

By this, do you mean setting PYTORCH_NO_CUDA_MEMORY_CACHING as mentioned in CUDA semantics — PyTorch 2.1 documentation? If so, I’m pleasantly surprised that we’re doing better.

We tried, but people keep opening new threads across various platforms…

LucasMSpereira · January 13, 2023, 7:09pm

No, I didn’t try that yet. Recently, I read A ConvNet for the 2020s and started questioning the size of my NN. I then implemented a ConvNeXt-inspired generator with half the amount of params. Epoch time was halved, and I started wondering if I could re-enable the GPU memory pool, since I remember disabling it was very bad for speed. I updated CUDA.jl to 3.12.1 and tried to train this new generator. As before, I got GPU OOM in the first batch. This seems to happen in the forward and backward pass through the generator. Here are the outputs of CUDA.memory_status() right before and right after the Flux.gradient(...) call:

Yes, that’s the exact page where I started searching for possible solutions. I tried os.environ['PYTORCH_NO_CUDA_MEMORY_CACHING'] = '1' and torch.cuda.empty_cache(); gc.collect(), but nothing changed.

ToucheSir · January 13, 2023, 8:13pm

Looks like you’re right on the boundary of total GPU memory, so disabling the pool seems reasonable (do have a look at CuIterator though if you haven’t already). For posterity, could you post the full stacktrace that comes with the OOM error?

LucasMSpereira · January 13, 2023, 8:53pm

Sure. Here’s the stacktrace (middle section of items 39 and 44 not included due to character limits in discourse):

ERROR: Out of GPU memory trying to allocate 1.112 GiB
Effective GPU memory usage: 100.00% (12.000 GiB/12.000 GiB)
Memory pool usage: 8.306 GiB (11.344 GiB reserved)
Stacktrace:
  [1] macro expansion
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\pool.jl:320 [inlined]
  [2] macro expansion
    @ .\timing.jl:382 [inlined]
  [3] #_alloc#170
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\pool.jl:313 [inlined]
  [4] #alloc#169
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\pool.jl:299 [inlined]
  [5] alloc
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\pool.jl:293 [inlined]
  [6] CuArray
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\array.jl:42 [inlined]
  [7] CuArray
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\array.jl:125 [inlined]
  [8] CuArray
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\src\array.jl:132 [inlined]
  [9] with_workspace(f::CUDA.CUDNN.var"#1196#1200"{CUDA.CUDNN.cudnnTensorDescriptor, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnTensorDescriptor, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CUDNN.cudnnConvolutionDescriptor, CUDA.CUDNN.cudnnFilterDescriptor, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{CUDA.CUDNN.cudnnConvolutionBwdFilterAlgoPerfStruct}, Vector{Int32}, Int64}, eltyp::Type{UInt8}, size::CUDA.CUDNN.var"#workspaceSize#1199"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, fallback::Nothing; keep::Bool)
    @ CUDA.APIUtils C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\lib\utils\call.jl:65
 [10] with_workspace
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\lib\utils\call.jl:56 [inlined]
 [11] #with_workspace#1
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\lib\utils\call.jl:53 [inlined]
 [12] with_workspace (repeats 2 times)
    @ C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\lib\utils\call.jl:53 [inlined]
 [13] cudnnConvolutionBwdFilterAlgoPerf(xDesc::CUDA.CUDNN.cudnnTensorDescriptor, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, dyDesc::CUDA.CUDNN.cudnnTensorDescriptor, dy::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, convDesc::CUDA.CUDNN.cudnnConvolutionDescriptor, dwDesc::CUDA.CUDNN.cudnnFilterDescriptor, dw::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, allocateTmpBuf::Bool)
    @ CUDA.CUDNN C:\Users\kaoid\.julia\packages\CUDA\Ey3w2\lib\cudnn\convolution.jl:254
 [14] ∇conv_filter!(dw::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, dy::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; alpha::Int64, beta::Int64, algo::Int64)
    @ NNlibCUDA C:\Users\kaoid\.julia\packages\NNlibCUDA\gzTJY\src\cudnn\conv.jl:118
 [15] ∇conv_filter!
    @ C:\Users\kaoid\.julia\packages\NNlibCUDA\gzTJY\src\cudnn\conv.jl:107 [inlined]
 [16] #∇conv_filter#213
    @ C:\Users\kaoid\.julia\packages\NNlib\T3z9T\src\conv.jl:112 [inlined]
 [17] ∇conv_filter
    @ C:\Users\kaoid\.julia\packages\NNlib\T3z9T\src\conv.jl:107 [inlined]
 [18] #333
    @ C:\Users\kaoid\.julia\packages\NNlib\T3z9T\src\conv.jl:322 [inlined]
 [19] unthunk
    @ C:\Users\kaoid\.julia\packages\ChainRulesCore\C73ay\src\tangent_types\thunks.jl:204 [inlined]
 [20] wrap_chainrules_output
    @ C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\chainrules.jl:105 [inlined]
 [21] map
    @ .\tuple.jl:223 [inlined]
 [22] map
    @ .\tuple.jl:224 [inlined]
 [23] wrap_chainrules_output
    @ C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\chainrules.jl:106 [inlined]
 [24] ZBack
    @ C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\chainrules.jl:206 [inlined]
 [25] Pullback
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\conv.jl:333 [inlined]
 [26] (::typeof(∂(λ)))(Δ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface2.jl:0
 [27] macro expansion
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\basic.jl:53 [inlined]
 [28] Pullback
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\basic.jl:53 [inlined]
 [29] (::typeof(∂(_applychain)))(Δ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface2.jl:0
 [30] Pullback
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\basic.jl:51 [inlined]
 [31] macro expansion
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\basic.jl:53 [inlined]
 [32] Pullback
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\basic.jl:53 [inlined]
 [33] (::typeof(∂(_applychain)))(Δ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface2.jl:0
 [34] Pullback
    @ C:\Users\kaoid\.julia\packages\Flux\v79Am\src\layers\basic.jl:51 [inlined]
 [35] (::typeof(∂(λ)))(Δ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface2.jl:0
 [36] Pullback
    @ C:\Users\kaoid\My Drive\Estudo\Poli\Pesquisa\Programas\QuickTO\QuickTO\utilities\ML utils\losses.jl:46 [inlined]
 [37] (::typeof(∂(λ)))(Δ::Float32)
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface2.jl:0
 [38] (::Zygote.var"#60#61"{typeof(∂(λ))})(Δ::Float32)
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface.jl:45
 [39] withgradient(f::Function, args::Chain{Tuple{Chain{Tuple{Conv{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ChannelLayerNorm{Flux.Scale{typeof(identity), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Float32}}}, Chain{Tuple{SkipConnection{Chain{Tuple{Conv{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, var"#299#302", LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Bool}, var"#300#303", typeof(identity)}}, typeof(+)}}}, Chain{Tuple{SkipConnection{Chain{Tuple{Conv{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, var"#299#302", LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, 
Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Bool}, var"#300#303", Dropout{Float32, Int64, CUDA.RNG}}}, typeof(+)}}}, Chain{Tuple{SkipConnection{Chain{Tuple{Conv{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, var"#299#302", LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Bool}, var"#300#303", Dropout{Float32, Int64, CUDA.RNG}}} #300#303", Dropout{Float32, Int64, CUDA.RNG}}}, typeof(+)}}}, Chain{Tuple{ConvTranspose{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ConvTranspose{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Conv{2, 2, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Conv{2, 2, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Conv{2, 2, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(σ)}}}})
    @ Zygote C:\Users\kaoid\.julia\packages\Zygote\AS0Go\src\compiler\interface.jl:133
 [40] GANgrads(metaData_::GANmetaData, genInput::Array{Float32, 4}, FEAinfo::Array{Float32, 4}, realTopology::Array{Float32, 4})
    @ Main C:\Users\kaoid\My Drive\Estudo\Poli\Pesquisa\Programas\QuickTO\QuickTO\utilities\ML utils\losses.jl:45
 [41] GANepoch!(metaData::GANmetaData, goal::Symbol)
    @ Main C:\Users\kaoid\My Drive\Estudo\Poli\Pesquisa\Programas\QuickTO\QuickTO\utilities\ML utils\learning.jl:202
 [42] fixedEpochGANs(metaData::GANmetaData)
    @ Main C:\Users\kaoid\My Drive\Estudo\Poli\Pesquisa\Programas\QuickTO\QuickTO\utilities\ML utils\learning.jl:153
 [43] macro expansion
    @ C:\Users\kaoid\.julia\packages\Suppressor\tQJeL\src\Suppressor.jl:98 [inlined]
 [44] trainGANs(; genOpt_::Flux.Optimise.Adam, discOpt_::Flux.Optimise.Adam, genName_::String, discName_::String, metaDataName::String, originalFolder::String, epochs::Int64, valFreq::Int64, architectures::Tuple{Chain{Tuple{Chain{Tuple{Conv{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ChannelLayerNorm{Flux.Scale{typeof(identity), CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}}, Float32}}}, Chain{Tuple{SkipConnection{Chain{Tuple{Conv{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, var"#299#302", LayerNorm{typeof(identity), Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, 1}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Scale{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, 
Bool}, var"#300#303", typeof(identity)}}, typeof(+)}}}, Chain{Tuple{SkipConnection{Chain{Tuple{Conv{2, 4, A.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, BatchNorm{typeof(identity), CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(leakyrelu), typeof(flatten)}}, Chain{Tuple{Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}})
    @ Main C:\Users\kaoid\My Drive\Estudo\Poli\Pesquisa\Programas\QuickTO\QuickTO\utilities\ML utils\learning.jl:411
 [45] top-level scope
    @ .\timing.jl:262 [inlined]
 [46] top-level scope
    @ c:\Users\kaoid\My Drive\Estudo\Poli\Pesquisa\Programas\QuickTO\QuickTO\topoGAN.jl:0

Topic		Replies	Views
Memory usage increasing with each epoch Machine Learning cuda , flux	18	724	April 14, 2025
`CUDA error: out of memory` with Flux Machine Learning flux	4	1645	August 24, 2020
Memory challenges for Flux on Resnet Machine Learning gpu	8	1377	September 7, 2022
State of deep learning in Julia Machine Learning	18	15651	September 28, 2019
Flux's model-zoo CIFAR10 example saturates 8GB gpu General Usage gpu , flux	5	652	June 29, 2020

Flux runs out of memory

Related topics