Memory is not freed with CUDA and two REPLs

Hey,

I observed that on Julia 1.6.1 and CUDA 3.1.0 (and 2.6.3), I’m accumulating memory (with point wise multiplications) that I can only free when I close the REPL.

I also track usage with nvidia-smi and initially ìt was 409MiB / 7959MiB

The size of x is 1.250GiB and my GPU has ~8GiB.
I do the following in a first REPL:

julia> using CUDA

julia> x = CUDA.rand(Float32, 4096, 4096, 20);

julia> b = x .* x;

julia> b = x .* x;

julia> b = x .* x;

julia> b = x .* x;

julia> b = x .* x;

julia> b = x .* x;

julia> b = nothing; x=nothing;

julia> GC.gc(true);

julia> CUDA.memory_status()
Effective GPU memory usage: 71.08% (5.526 GiB/7.773 GiB)  # initially 409MiB /  7959MiB
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

I would have expected that in a second REPL (while keeping the first open) I can do the following again, but:

julia> using CUDA

julia> x = CUDA.rand(Float32, 4096, 4096, 20);

julia> b = x .* x;
ERROR: Out of GPU memory trying to allocate 1.250 GiB
Effective GPU memory usage: 91.40% (7.105 GiB/7.773 GiB)
CUDA allocator usage: 1.250 GiB
Memory pool usage: 1.250 GiB (1.250 GiB allocated, 0 bytes cached)

Stacktrace:
  [1] #alloc#244
    @ ~/.julia/packages/CUDA/k52QH/src/pool.jl:286 [inlined]
  [2] alloc
    @ ~/.julia/packages/CUDA/k52QH/src/pool.jl:278 [inlined]
  [3] CuArray{Float32, 3}(#unused#::UndefInitializer, dims::Tuple{Int64, Int64, Int64})
    @ CUDA ~/.julia/packages/CUDA/k52QH/src/array.jl:20
  [4] CuArray
    @ ~/.julia/packages/CUDA/k52QH/src/array.jl:101 [inlined]
  [5] similar
    @ ./abstractarray.jl:785 [inlined]
  [6] similar
    @ ./abstractarray.jl:784 [inlined]
  [7] similar
    @ ~/.julia/packages/CUDA/k52QH/src/broadcast.jl:11 [inlined]
  [8] copy
    @ ~/.julia/packages/GPUArrays/4n0iS/src/host/broadcast.jl:47 [inlined]
  [9] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{3}, Nothing, typeof(*), Tuple{CuArray{Float32, 3}, CuArray{Float32, 3}}})
    @ Base.Broadcast ./broadcast.jl:883
 [10] top-level scope
    @ REPL[3]:1
 [11] top-level scope
    @ ~/.julia/packages/CUDA/k52QH/src/initialization.jl:81

It looks like that the first REPL is still reserving the memory? Is that true?
How can I avoid that?
In the documentation I could only find the tips with GC.gc(true) etc.

I would be very happy to know how to fix this.

Thanks a lot,

Felix

Memory is being cached by the CUDA stream-ordered allocator for future reuse. This isn’t compatible with using multiple instances of Julia using the same GPU. If you really need the memory, you can run with JULIA_CUDA_MEMORY_POOL=none, but this is obviously going to hurt performance.

1 Like

So the advise is the keep a single Julia instance (single IJulia kernel) running with access to the GPU?

Thanks a lot for your quick help!

Yeah, or you’d need to disable the pool. It’s possible to wipe the pool programmatically, too, but that depends on the active pool, and you’d need to run the commands frequently, so I don’t think you should bother.

1 Like

Hm, I see.

The following still happens to me, and I’m not sure how to prevent that.
After some time it looks like memory is leaked and I can’t calculate the FFT without a manual GC.gc(true) call in between:

Can there be a problem with memory leakage of CuFFT within Julia?

julia> using CUDA, FFTW

julia> x = CUDA.rand(ComplexF32, (512, 512, 512)); # 1GiB memory

julia> CUDA.memory_status()
Effective GPU memory usage: 24.61% (1.913 GiB/7.773 GiB)
CUDA allocator usage: 1.000 GiB
Memory pool usage: 1.000 GiB (1.000 GiB allocated, 0 bytes cached)

julia> CUDA.@time y = fft(x);
  0.995551 seconds (2.68 M CPU allocations: 147.661 MiB, 2.32% gc time) (2 GPU allocations: 2.000 GiB, 4.73% gc time of which 9.55% spent allocating)

julia> CUDA.memory_status()
Effective GPU memory usage: 50.96% (3.962 GiB/7.773 GiB)
CUDA allocator usage: 2.000 GiB
Memory pool usage: 2.000 GiB (2.000 GiB allocated, 0 bytes cached)

julia> GC.gc(true)

julia> CUDA.memory_status()
Effective GPU memory usage: 63.83% (4.962 GiB/7.773 GiB)
CUDA allocator usage: 2.000 GiB
Memory pool usage: 2.000 GiB (2.000 GiB allocated, 0 bytes cached)

julia> CUDA.@time y = fft(x);
  0.027007 seconds (329.92 k CPU allocations: 5.035 MiB) (2 GPU allocations: 2.000 GiB, 0.05% gc time of which 61.26% spent allocating)

julia> CUDA.@time y = fft(x);
  0.083744 seconds (251.18 k CPU allocations: 3.833 MiB, 4.40% gc time) (2 GPU allocations: 2.000 GiB, 68.15% gc time of which 85.60% spent allocating)

julia> CUDA.@time y = fft(x);
  0.037591 seconds (373.56 k CPU allocations: 5.701 MiB, 7.31% gc time) (2 GPU allocations: 2.000 GiB, 18.91% gc time of which 0.13% spent allocating)

julia> CUDA.@time y = fft(x);
  0.037605 seconds (371.99 k CPU allocations: 5.677 MiB, 7.70% gc time) (2 GPU allocations: 2.000 GiB, 19.19% gc time of which 0.14% spent allocating)

julia> CUDA.@time y = fft(x);
  0.038422 seconds (365.12 k CPU allocations: 5.573 MiB, 8.85% gc time) (2 GPU allocations: 2.000 GiB, 20.55% gc time of which 0.16% spent allocating)

julia> CUDA.@time y = fft(x);
  0.038034 seconds (362.32 k CPU allocations: 5.529 MiB, 8.32% gc time) (2 GPU allocations: 2.000 GiB, 20.03% gc time of which 0.15% spent allocating)

julia> GC.gc(true)

julia> CUDA.memory_status()
Effective GPU memory usage: 76.69% (5.962 GiB/7.773 GiB)
CUDA allocator usage: 2.000 GiB
Memory pool usage: 2.000 GiB (2.000 GiB allocated, 0 bytes cached)

julia> CUDA.@time y = fft(x);
  0.027007 seconds (322.19 k CPU allocations: 4.917 MiB) (2 GPU allocations: 2.000 GiB, 0.05% gc time of which 66.57% spent allocating)

julia> CUDA.@time y = fft(x);
ERROR: CUFFTError: driver or internal cuFFT library error (code 5, CUFFT_INTERNAL_ERROR)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUFFT.cufftResult_t)
    @ CUDA.CUFFT ~/.julia/packages/CUDA/k52QH/lib/cufft/error.jl:64
  [2] macro expansion
    @ ~/.julia/packages/CUDA/k52QH/lib/cufft/error.jl:81 [inlined]
  [3] cufftMakePlan3d(plan::Int32, nx::Int64, ny::Int64, nz::Int64, type::CUDA.CUFFT.cufftType_t, workSize::Base.RefValue{UInt64})
    @ CUDA.CUFFT ~/.julia/packages/CUDA/k52QH/lib/utils/call.jl:26
  [4] create_plan(xtype::CUDA.CUFFT.cufftType_t, xdims::Tuple{Int64, Int64, Int64}, region::UnitRange{Int64})
    @ CUDA.CUFFT ~/.julia/packages/CUDA/k52QH/lib/cufft/fft.jl:137
  [5] plan_fft
    @ ~/.julia/packages/CUDA/k52QH/lib/cufft/fft.jl:293 [inlined]
  [6] #plan_fft#10
    @ ~/.julia/packages/FFTW/Iu2GG/src/fft.jl:693 [inlined]
  [7] plan_fft
    @ ~/.julia/packages/FFTW/Iu2GG/src/fft.jl:693 [inlined]
  [8] fft(x::CuArray{ComplexF32, 3})
    @ AbstractFFTs ~/.julia/packages/AbstractFFTs/JebmH/src/definitions.jl:50
  [9] macro expansion
    @ ~/.julia/packages/CUDA/k52QH/src/utilities.jl:28 [inlined]
 [10] top-level scope
    @ ~/.julia/packages/CUDA/k52QH/src/pool.jl:572 [inlined]
 [11] top-level scope
    @ ./REPL[18]:0
 [12] top-level scope
    @ ~/.julia/packages/CUDA/k52QH/src/initialization.jl:81

I made also the observation that the memory allocation is twice as high as in FFTW. What is the reason for that?

julia> using FFTW, CUDA

julia> x = randn(ComplexF32, (1024, 1024));

julia> x_c = CuArray(x);

julia> @time fft(x);
  0.297959 seconds (824.23 k allocations: 57.072 MiB, 23.40% gc time)

julia> @time fft(x);
  0.058378 seconds (29.23 k allocations: 9.767 MiB, 12.41% gc time, 22.68% compilation time)

julia> @time fft(x);
  0.041144 seconds (35 allocations: 8.003 MiB)

julia> CUDA.@time fft(x_c);
  1.145209 seconds (4.24 M CPU allocations: 235.730 MiB, 5.92% gc time) (2 GPU allocations: 16.000 MiB, 5.87% gc time of which 0.04% spent allocating)

julia> CUDA.@time fft(x_c);
  0.000613 seconds (2.72 k CPU allocations: 43.109 KiB) (2 GPU allocations: 16.000 MiB, 2.30% gc time of which 61.99% spent allocating)

julia> CUDA.@time fft(x_c);
  0.000698 seconds (2.75 k CPU allocations: 43.609 KiB) (2 GPU allocations: 16.000 MiB, 13.06% gc time of which 95.62% spent allocating)
1 Like

This is a hard problem. Turns out NVIDIA’s libraries are sensitive to close-to-OOM situations, at which point they start to throw random errors like the CUFFT_INTERNAL_ERROR you’re seeing here. If it were to throw the proper CUFFT_ALLOC_FAILED error, we’d empty the caching pool and try the API call again, but that mechanism doesn’t trigger here. Similarly, we normally try to keep a ‘reserve’ of free memory, CUDA.jl/pool.jl at a86d3cf785be1821614be90b57c3e3d0c80939cd · JuliaGPU/CUDA.jl · GitHub, which is apparently ineffective here…

1 Like

That sounds bad, any trick how to avoid that?

I observe that my optimizations (heavily based on fft) just fail in the second or third run.

Could you open an issue? There’s a couple of things we can do: special case that call in CUDA.jl to also retry on an INTERNAL ERROR, try to allocate our own workspace that’s part of the cached pool, etc.

2 Likes

I’ll try to create one later.
Thanks for your help!

Edit: https://github.com/JuliaGPU/CUDA.jl/issues/894