Hi! GPGPU beginner here using the CUDAnative/CuArrays stack (Thank you @maleadt for this amazing work)
I am hoping to find some help regarding efficiently transferring data to and from host/device when handling input and output data that is much larger than can be stored all at once in device memory. (in my case I have 5GB device memory to work with)
I am working on image processing that involves performing thousands of convolutions. Since the rate limiting step tends to be FFTs, I want to be running these large array stacks through the GPU.
The MWE of what I am using is below.
When I run this through nvprof I can see clearly that the actual FFT is fast and most of my time is spent on data transfer. I can also tell that the data transfer is serialized (HtD and DtH happen sequentially), but I am not sure how to effectively implement parallel transfer of data to and from the host using CUDAnative (from browsing CUDA literature online I know that I should be considering that.)
What’s strange to me are a few things:
-
From my limited understanding there ought to be extra overhead associated with each transfer of data between host and device, such that many small transfers are slower than batched jobs. However, splitting the data into individual images rather than batched stacks turns out to be more performant in my case. The GPU speedup is also modest enough that it suggest running multithreaded on the CPU would be faster.
-
Although I can see some speed increases as I increase the batch size, I hit a ceiling VERY quickly before I start getting memory allocation failures. Obviously when creating a FFT plan, there’s a call to check for availability of workspace for the FFT computation. However it seems absurd–passing a stack of less than 500mb throws an allocation error, implying 10X the amount memory is needed by the device to run an FFT…???
-
In general, a function like that shown in the MWE will need to be run thousands of times in my final algorithm. I have noticed creeping memory leaks when running my code despite trying to keep everything encapsulated in its own scope/context. Is there something I can do to mitigate that?
Cheers and thanks in advance for any comments or help.
using FFTW
using CuArrays, CuArrays.CUFFT
using CUDAdrv, CUDAnative
# Data sized similar to real use. Reduce 3rd dimension if limited host RAM
A = rand(ComplexF32, 383, 765, 3000)
OUT = zeros(Float32, 765, 765, 3000)
chunksize = 100
function reverseFFT_gpu(A, OUT, chunksize)
dev = CuDevice(0)
ctx = CuContext(dev)
loops = fld(size(A,3),chunksize)
A_GPU = CuArray(A[:,:,1:chunksize])
OUT_GPU = cuzeros(Float32, 765,765, chunksize)
invplan = plan_brfft(A_GPU, 765, [1,2])
println("GPU -- Chunksize: $chunksize")
@time begin
for k in 1:loops
gpustart = 1 + (k-1)*chunksize
gpustop = k*chunksize
A_GPU .= CuArray(view(A,:,:,(gpustart:gpustop)))
OUT_GPU .= invplan*A_GPU
OUT[:,:,gpustart:gpustop] .= collect(OUT_GPU)
end
destroy!(ctx)
end
return
end
function reverseFFT_cpu(A, OUT, chunksize)
loops = fld(size(A,3),chunksize)
invplan = plan_brfft(A[:,:,1:chunksize], 765, [1,2])
println("CPU -- Chunksize: $chunksize")
@time begin
for k in 1:loops
cpustart = 1 + (k-1)*chunksize
cpustop = k*chunksize
@views OUT[:,:,cpustart:cpustop] .= invplan*A[:,:,(cpustart:cpustop)]
end
end
return
end
function reverseFFT_unbatched(A, OUT)
dev = CuDevice(0)
ctx = CuContext(dev)
A_GPU = CuArray(A[:,:,1])
invplan = plan_brfft(A_GPU, 765, [1,2])
println("GPU -- Single Slice")
@time begin
for k in 1:size(A,3)
A_GPU .= CuArray(A[:,:,k])
OUT[:,:,k] .= collect(invplan*A_GPU)
end
destroy!(ctx)
end
return
end
# Run tests
reverseFFT_cpu(A, OUT, chunksize)
reverseFFT_gpu(A, OUT, chunksize)
reverseFFT_unbatched(A, OUT)