CuArrays and garbage collection when doing FFT convolutions

rveltz · July 16, 2018, 10:52am

Hi,

I am using CuArrays to perform fft convolutions and I am getting the error

ERROR: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory).

I have threads about collecting temporaries but I don’t know how to do this here.

I’d be happy to have some hints,

Thank you,

Romain

Here is a MWE (hopefully).

using CuArrays, GPUArrays, CUDAnative
import Base: *

module Convolution
    export convolution

    struct convolution
        kernel::AbstractArray
        kernel_fft::AbstractArray
        tmp::AbstractArray
        fft_flag::Int
        N::Int
        n::Int
        gpu::Bool
        p_forward
        p_backward

        function convolution(kernel::AbstractArray,gpu = false)
            if gpu
                x = new(kernel,fft(fftshift(kernel)),
                    fft(kernel),0,size(kernel)[1],
                    prod(size(kernel)),gpu,
                    0,
                    0)
            else
                error("not here")
            end
            return x
        end
    end
end

function *(cv::Convolution.convolution, x::AbstractArray)
    # return ifftshift(real(irfft(cv.kernel_fft .* rfft(fftshift(x)),cv.N)))
    if cv.gpu
        return real.(   ifftshift(ifft(cv.kernel_fft .* fft(fftshift(x)))))
    else
        return ifftshift(real(ifft(cv.kernel_fft .* fft(fftshift(x)))))
    end
end

TY = Float32

dev = CUDAnative.CuDevice(0)
const gpu = cu

N = 2^10
L = 100
hx = 2L/N |> TY
println("\n\n###############\n Neural Field solution , N = $N, dx= $hx\n"*"#"^20)
X = TY.(-L + hx * collect(0:N-1) )
g = TY(1e-4)*exp.(-(1 * X.^2 .+ 1 * X'.^2)/10) |> gpu
J = Convolution.convolution(g,true)

v = rand(N,N) |> gpu;
v2=zeros(v)
for ii=1:20000
	v2 .= J * v
	(ii,GPUArrays.free_global_memory(dev)) |> println
end

rveltz · July 17, 2018, 6:20am

Should I be pessimistic about this? Or maybe a macro ala @af_gc is possible like in ArrayFire?

rveltz · July 23, 2018, 7:03pm

Does anybody has an advice? my code is not usable with this issue…

hskramer · July 24, 2018, 12:12am

I have two P5000 and it did finish but it required 16gb of gpu ram the only option I can think of is unified memory. I know the ArrayFire library has support for unified memory so I would look into it and see if it’s been implemented in julia’s ArrayFire.
Good Luck
H. Kramer

Ralph_Smith · July 24, 2018, 4:06am

If you use saved plans to do the transforms there is less stress. (It looks as if you were intending to do that at some point.) Also, putting the for loop in a function seems to give the system a better chance to clean up. With these changes I could run your problem on a small GPU, although Julia did grab all of its memory during the loop.

The problem is not strictly related to FFTs - just creating the temporaries for the fftshift calls leads to huge memory usage.

rveltz · July 24, 2018, 5:22am

Thank you for your input.

rveltz · July 24, 2018, 5:22am

How did you come to your last conclusion?

Ralph_Smith · July 24, 2018, 12:01pm

I just replaced your product expression with ifftshift(cv.kernel .* fftshift(x)) and watched the memory status. On second thought, my last conclusion is probably wrong - the garbage collection seems to run only when needed, and this works for the cases without repeated plan generation, but not with your original version.

sdanisch · July 24, 2018, 2:50pm

You should be able to plan the fft && preallocate!

E.g. like I do here:
https://github.com/SimonDanisch/SchroedingersSmoke.jl/blob/master/src/parallel.jl#L79
https://github.com/SimonDanisch/SchroedingersSmoke.jl/blob/master/src/parallel.jl#L170

Should in theory work with CLArrays & CuArrays like that!

rveltz · July 24, 2018, 5:45pm

You think the plan is allocating more than your temporary isf.fc?

sdanisch · July 24, 2018, 10:21pm

I don’t really understand that question…
You plan an inplace fft and preallocate a buffer for it, so you can have allocation free hot loops, like the one in possion_solve - calling poission_solve should pretty much have 0 allocations.

rveltz · July 25, 2018, 7:50pm

Oh this is nice indeed! I will try it, thank you

rveltz · March 6, 2019, 8:39pm

Hi,

Thank you for the suggestion, it worked!

The only allocation is now from fftshift. Is there an inplace version?

Topic		Replies	Views
Using CUDA fft General Usage	1	1148	March 13, 2019
How to avoid memory blow up with FFT without expensive garbage collection? Performance question , fftw , parallel , memory-allocation , fft	3	355	June 10, 2024
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1021	August 14, 2018
CUFFT.plan_fft! take a lot of memory, cannot be freed GPU memory	3	496	August 3, 2023
Avoiding Memory leaks using CuArrays GPU performance , flux	3	1636	May 24, 2019

CuArrays and garbage collection when doing FFT convolutions

Related topics