Hi, I’m playing with CUDA.jl
for FFT computations. I use CUFFT.plan_fft!
to perform in-place FFT on large complex arrays. What I found was the in-place plan itself seems to occupy a large chunk of GPU memory about the same as the array itself. Moreover, I can’t seem to free this memory even if I set both objects to nothing
. This means if I run the same code twice, the second time I run into error:
julia> versioninfo()
using CUDA, FFTW, BenchmarkTools, LinearAlgebra
CUDA.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 1 on 24 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =
CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3
CUDA libraries:
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+535.54.3
Julia packages:
- CUDA: 4.4.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0
Toolchain:
- Julia: 1.9.2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
2 devices:
0: NVIDIA GeForce RTX 3090 (sm_86, 23.676 GiB / 24.000 GiB available)
1: NVIDIA GeForce RTX 3090 (sm_86, 22.611 GiB / 24.000 GiB available)
julia> N=36000
Ac = CUDA.rand(ComplexF32,(N,N))
CUDA.memory_status()
Effective GPU memory usage: 41.88% (9.922 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)
julia> pc=CUDA.CUFFT.plan_fft!(Ac)
CUDA.memory_status()
Effective GPU memory usage: 82.67% (19.584 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)
julia> @benchmark CUDA.@sync begin
$pc*$Ac;
$Ac./=N;
end
CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)
julia> Ac = nothing; GC.gc(true)
CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)
julia> pc = nothing; GC.gc(true)
CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 0 bytes (9.656 GiB reserved)
julia> Ac = CUDA.rand(ComplexF32,(N,N))
CUDA.memory_status()
Effective GPU memory usage: 82.78% (19.612 GiB/23.691 GiB)
Memory pool usage: 9.656 GiB (9.656 GiB reserved)
julia> pc=CUDA.CUFFT.plan_fft!(Ac)
CUDA.memory_status()
ERROR: CUFFTError: driver or internal cuFFT library error (code 5, CUFFT_INTERNAL_ERROR)
...
In the second last command, I can allocate Ac
again without allocating additional memory. I guess the unreleased memory for Ac
was used. However, when creating pc
again, I ran into the error that should indicate out of memory.
So how can I do this correctly, namely, create a true “in-place” FFT, and be able to release memory after use?