I want to use CUDA.jl instead of CUDA C/C++ on Jetson nano (Single-board computer with GPU), but I am puzzled by the inexplicable memory usage when executing CUFFT.ifft(). I have confirmed that the memory usage of the Julia process increases by about 800 MB only when CUFFT.ifft() is executed on multiple environments, including Jetson, Ubuntu, and Windows. What is happening? The memory increase after the CUFFT.fft() run is about 180 MB.
I also have tried handling plans, but nothing has changed. It seems to me that the fact that the plan generated by CUFFT.plan_ifft() is not CUFFT.cCuFFTPlan{}, but AbstractFFT.ScaledPlan{} may have something to do with the problem, but I am not sure. The problem occurs similarly when the plan is generated, the input array is multiplied and the FFT is executed.
Is there a way to somehow execute fft() and ifft() with less memory usage?
POSTSCRIPT
I will add specific code that can reproduce the problem.
using CUDA,CUFFT
x = ComplexF32.(CUDA.rand(1024,1024))
# 1024×1024 CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer}
# this is OK (Memory usage of the Julia process will increase by about 180 MB.)
fft_x = CUFFT.fft(x)
# for fist time
# 0.965564 seconds (4.46 M allocations: 229.851 MiB, 8.06% gc time, 95.79% compilation time)
# the second time
# 0.005171 seconds (16 allocations: 672 bytes)
# Memory usage of Julia process up to this point is 496 MB.
# this is the problem
ifft_x = CUFFT.ifft(x)
# for fist time
# 1.399079 seconds (1.26 M allocations: 67.445 MiB, 2.01% gc time, 63.29% compilation time)
# the second time
# 0.005467 seconds (17 allocations: 704 bytes)
# Then, memory usage increased to 1.2 GB.