CuArrays and garbage collection when doing FFT convolutions

Hi,

I am using CuArrays to perform fft convolutions and I am getting the error

ERROR: CUFFTError(code 2, cuFFT failed to allocate GPU or CPU memory).

I have threads about collecting temporaries but I don’t know how to do this here.

I’d be happy to have some hints,

Thank you,

Romain

Here is a MWE (hopefully).

using CuArrays, GPUArrays, CUDAnative
import Base: *

module Convolution
    export convolution

    struct convolution
        kernel::AbstractArray
        kernel_fft::AbstractArray
        tmp::AbstractArray
        fft_flag::Int
        N::Int
        n::Int
        gpu::Bool
        p_forward
        p_backward

        function convolution(kernel::AbstractArray,gpu = false)
            if gpu
                x = new(kernel,fft(fftshift(kernel)),
                    fft(kernel),0,size(kernel)[1],
                    prod(size(kernel)),gpu,
                    0,
                    0)
            else
                error("not here")
            end
            return x
        end
    end
end

function *(cv::Convolution.convolution, x::AbstractArray)
    # return ifftshift(real(irfft(cv.kernel_fft .* rfft(fftshift(x)),cv.N)))
    if cv.gpu
        return real.(   ifftshift(ifft(cv.kernel_fft .* fft(fftshift(x)))))
    else
        return ifftshift(real(ifft(cv.kernel_fft .* fft(fftshift(x)))))
    end
end

TY = Float32

dev = CUDAnative.CuDevice(0)
const gpu = cu

N = 2^10
L = 100
hx = 2L/N |> TY
println("\n\n###############\n Neural Field solution , N = $N, dx= $hx\n"*"#"^20)
X = TY.(-L + hx * collect(0:N-1) )
g = TY(1e-4)*exp.(-(1 * X.^2 .+ 1 * X'.^2)/10) |> gpu
J = Convolution.convolution(g,true)

v = rand(N,N) |> gpu;
v2=zeros(v)
for ii=1:20000
	v2 .= J * v
	(ii,GPUArrays.free_global_memory(dev)) |> println
end

1 Like

Should I be pessimistic about this? Or maybe a macro ala @af_gc is possible like in ArrayFire?

Does anybody has an advice? my code is not usable with this issue…

I have two P5000 and it did finish but it required 16gb of gpu ram the only option I can think of is unified memory. I know the ArrayFire library has support for unified memory so I would look into it and see if it’s been implemented in julia’s ArrayFire.
Good Luck
H. Kramer

If you use saved plans to do the transforms there is less stress. (It looks as if you were intending to do that at some point.) Also, putting the for loop in a function seems to give the system a better chance to clean up. With these changes I could run your problem on a small GPU, although Julia did grab all of its memory during the loop.

The problem is not strictly related to FFTs - just creating the temporaries for the fftshift calls leads to huge memory usage.

Thank you for your input.

How did you come to your last conclusion?

I just replaced your product expression with ifftshift(cv.kernel .* fftshift(x)) and watched the memory status. On second thought, my last conclusion is probably wrong - the garbage collection seems to run only when needed, and this works for the cases without repeated plan generation, but not with your original version.

You should be able to plan the fft && preallocate!

E.g. like I do here:
https://github.com/SimonDanisch/SchroedingersSmoke.jl/blob/master/src/parallel.jl#L79
https://github.com/SimonDanisch/SchroedingersSmoke.jl/blob/master/src/parallel.jl#L170

Should in theory work with CLArrays & CuArrays like that!

1 Like

You think the plan is allocating more than your temporary isf.fc?

I don’t really understand that question…
You plan an inplace fft and preallocate a buffer for it, so you can have allocation free hot loops, like the one in possion_solve - calling poission_solve should pretty much have 0 allocations.

1 Like

Oh this is nice indeed! I will try it, thank you

Hi,

Thank you for the suggestion, it worked!

The only allocation is now from fftshift. Is there an inplace version?