CUFFT.plan_fft! take a lot of memory, cannot be freed

jinfreedom · August 1, 2023, 10:56pm

Hi, I’m playing with CUDA.jl for FFT computations. I use CUFFT.plan_fft! to perform in-place FFT on large complex arrays. What I found was the in-place plan itself seems to occupy a large chunk of GPU memory about the same as the array itself. Moreover, I can’t seem to free this memory even if I set both objects to nothing. This means if I run the same code twice, the second time I run into error:

julia> versioninfo()
       using CUDA, FFTW, BenchmarkTools, LinearAlgebra
       CUDA.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 
CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 4.4.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0

Toolchain:
- Julia: 1.9.2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: NVIDIA GeForce RTX 3090 (sm_86, 23.676 GiB / 24.000 GiB available)
  1: NVIDIA GeForce RTX 3090 (sm_86, 22.611 GiB / 24.000 GiB available)

julia> N=36000
       Ac = CUDA.rand(ComplexF32,(N,N))
       CUDA.memory_status()
Effective GPU memory usage: 41.88% (9.922 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> pc=CUDA.CUFFT.plan_fft!(Ac)
       CUDA.memory_status()
Effective GPU memory usage: 82.67% (19.584 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> @benchmark CUDA.@sync begin
           $pc*$Ac; 
           $Ac./=N;
       end
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> Ac = nothing; GC.gc(true)
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> pc = nothing; GC.gc(true)
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 0 bytes (9.656 GiB reserved)

julia> Ac = CUDA.rand(ComplexF32,(N,N))
       CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)

julia> pc=CUDA.CUFFT.plan_fft!(Ac)
       CUDA.memory_status()
ERROR: CUFFTError: driver or internal cuFFT library error (code 5, CUFFT_INTERNAL_ERROR)
...

In the second last command, I can allocate Ac again without allocating additional memory. I guess the unreleased memory for Ac was used. However, when creating pc again, I ran into the error that should indicate out of memory.
So how can I do this correctly, namely, create a true “in-place” FFT, and be able to release memory after use?

maleadt · August 2, 2023, 7:54am

We additionally cache handles in a HandleCache (CUFFT.idle_handles). That currently isn’t memory-pressure aware, and it probably should. You can try emptying it, although it currently doesn’t have a convenient empty! API implemented (you could try empty!(CUFFT.idle_handles.idle_handles or something like that). Probably worth filing an issue about.

Ralph_Smith · August 2, 2023, 2:27pm

The plans have finalizers which should release the memory (or mark it as available) when they go out of scope. If you create and use the plans inside a function the finalizers are more likely to be invoked correctly.

maleadt · August 3, 2023, 8:03am

They are cached; see https://github.com/JuliaGPU/CUDA.jl/blob/d79adbfd090b0e51ccaf4c74710eaa610e0bf998/lib/cufft/wrappers.jl#L138-L160

Topic		Replies	Views
Using CUDA fft General Usage	1	1170	March 13, 2019
CuArrays and garbage collection when doing FFT convolutions Performance gpu , gpuarrays	12	1734	March 6, 2019
Memory is not freed with CUDA and two REPLs GPU cuda	8	1567	May 7, 2021
Unexpectedly high memory usage when running CUFFT.ifft() GPU question	3	621	June 2, 2022
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1042	August 14, 2018

CUFFT.plan_fft! take a lot of memory, cannot be freed

Related topics