CUFFT.plan_fft! take a lot of memory, cannot be freed

Hi, I’m playing with CUDA.jl for FFT computations. I use CUFFT.plan_fft! to perform in-place FFT on large complex arrays. What I found was the in-place plan itself seems to occupy a large chunk of GPU memory about the same as the array itself. Moreover, I can’t seem to free this memory even if I set both objects to nothing. This means if I run the same code twice, the second time I run into error:

julia> versioninfo()
       using CUDA, FFTW, BenchmarkTools, LinearAlgebra
       CUDA.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 
CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 4.4.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0

Toolchain:
- Julia: 1.9.2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: NVIDIA GeForce RTX 3090 (sm_86, 23.676 GiB / 24.000 GiB available)
  1: NVIDIA GeForce RTX 3090 (sm_86, 22.611 GiB / 24.000 GiB available)

julia> N=36000
       Ac = CUDA.rand(ComplexF32,(N,N))
       CUDA.memory_status()
Effective GPU memory usage: 41.88% (9.922 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> pc=CUDA.CUFFT.plan_fft!(Ac)
       CUDA.memory_status()
Effective GPU memory usage: 82.67% (19.584 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> @benchmark CUDA.@sync begin
           $pc*$Ac; 
           $Ac./=N;
       end
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> Ac = nothing; GC.gc(true)
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> pc = nothing; GC.gc(true)
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 0 bytes (9.656 GiB reserved)

julia> Ac = CUDA.rand(ComplexF32,(N,N))
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> pc=CUDA.CUFFT.plan_fft!(Ac)
       CUDA.memory_status()
ERROR: CUFFTError: driver or internal cuFFT library error (code 5, CUFFT_INTERNAL_ERROR)
...

In the second last command, I can allocate Ac again without allocating additional memory. I guess the unreleased memory for Ac was used. However, when creating pc again, I ran into the error that should indicate out of memory.
So how can I do this correctly, namely, create a true “in-place” FFT, and be able to release memory after use?

1 Like

We additionally cache handles in a HandleCache (CUFFT.idle_handles). That currently isn’t memory-pressure aware, and it probably should. You can try emptying it, although it currently doesn’t have a convenient empty! API implemented (you could try empty!(CUFFT.idle_handles.idle_handles or something like that). Probably worth filing an issue about.

The plans have finalizers which should release the memory (or mark it as available) when they go out of scope. If you create and use the plans inside a function the finalizers are more likely to be invoked correctly.

They are cached; see https://github.com/JuliaGPU/CUDA.jl/blob/d79adbfd090b0e51ccaf4c74710eaa610e0bf998/lib/cufft/wrappers.jl#L138-L160