Hey thanks for the feedback!
Data transfer time >>> than compute time no matter what size stack I haul over to the GPU. So I don’t think parallel kernels will make much of a difference.
CUDAdrv has indeed helped solve some issues. The other was not calling destroy_plan()
. The first code example below exactly produces no memory accumulation. Some things to note: calls to GC.gc()
don’t free memory ever in my hands. Calling finalize()
on any CuArray similarly fails. Only CUDAdrv’s Mem.free(::Buffer)
works consistently. Also (oddly) calling finalize on the context AFTER running destroy!
frees memory that otherwise never gets collected.
And yes, async via streams/events is where I am headed, or at least trying to. A couple hurdles right now that maybe (please?) you could look into…
-
Implementation of
cudaMallocHost
in a straightforward way (e.g. returning an indexable array wrapping the memory). I found burried inCUDAdrv.Mem.alloc()
that I can set ATTACH_HOST and get what looks like the correct behavior. Have a look at the second code sample. Its super crude and obviously not stable but I get memory transfer rates between host and device that are 5X as fast… -
Handles for
cufftSetStream()
and possiblycufftEstimate*D()
would be super duper helpful too.
Again thanks for your hard work. I dont have the expertise to help with coding to offer extra manpower, but perhaps I can do some documentation work and submit as pull requests as a thank you. It would help me in the end to have some functions described when searched.
using FFTW
using CuArrays, CuArrays.CUFFT
using CUDAdrv, CUDAnative, CUDAdrv.Mem
A = round.(rand(ComplexF32, 383, 765, 3000).*100)
OUT = zeros(Float32, 765, 765, 3000)
chunksize = 100
@inline function bigrfft(A,OUT,chunksize)
dev = CuDevice(0)
ctx = CuContext(dev)
CUDAdrv.activate(ctx)
A_GPU = CuArray(A[:,:,1:chunksize])
abuffer = CuArrays.buffer(A_GPU)
OUT_GPU = cuzeros(Float32, size(OUT[:,:,1:chunksize]))
outbuffer = CuArrays.buffer(OUT_GPU)
p = plan_brfft(A_GPU, 765, [1,2])
for j in 1:cld(size(A,3),chunksize)
gpustart = 1 + (j-1)*chunksize
gpustop = j*chunksize
Mem.upload!(abuffer, A[:,:,gpustart:gpustop])
copyto!(OUT_GPU,p*A_GPU)
CUDAdrv.synchronize(ctx)
Mem.download!(OUT[:,:,gpustart:gpustop], outbuffer)
end
Mem.free(abuffer)
Mem.free(outbuffer)
CuArrays.CUFFT.destroy_plan(p)
CUDAdrv.destroy!(ctx)
ctx = finalize(ctx)
return
end
bigrfft(A,OUT,chunksize)
Second example. Using pinned memory results in 5X host-gpu bandwidth on my workstation. Advisable to evaluate block by block because I can’t guarantee stability.
using CuArrays, CUDAdrv, CUDAdrv.Mem, CUDAnative
using Test
A = round.(rand(ComplexF32, 383, 765, 300).*100)
Asize = sizeof(A)
#Allocate memory to HOST that the GPU has pinned
function pintoHost(hostArray::Array{T}) where T
arraysize = sizeof(hostArray)
arraybuffer = CUDAdrv.Mem.alloc(arraysize,true; flags=Mem.ATTACH_HOST)
arrayptr = Base.unsafe_convert(Ptr{T}, arraybuffer.ptr)
pinnedarray = unsafe_wrap(Array, arrayptr, size(hostArray); own = false)
copyto!(pinnedarray,hostArray)
return (pinnedarray,arraybuffer)
end
# Verify x is a bonafide copy
(x, xbuffer) = pintoHost(A)
@test x ≈ A
zero_GPU = cuzeros(ComplexF32,size(A))
zero1_GPU = cuzeros(ComplexF32, size(A))
zerobuffer = CuArrays.buffer(zero_GPU)
zero1buffer = CuArrays.buffer(zero1_GPU)
# Pinned memory transfer ≈ 5X faster than regular upload (for me)
@time Mem.transfer!(zerobuffer, xbuffer, sizeof(A))
@time Mem.upload!(zero1buffer, A)
@test collect(zero_GPU) ≈ A
@test collect(zero1_GPU) ≈ A
Mem.free(zerobuffer)
Mem.free(zero1buffer)
zero_GPU = nothing
zero1_GPU = nothing
zerobuffer = finalize(zerobuffer)
zero1buffer = finalize(zero1buffer)
# Test download for similar behavior
B = zeros(ComplexF32, size(A))
A_GPU = CuArray(rand(ComplexF32, size(A)).*100)
agpu_buffer = CuArrays.buffer(A_GPU)
(b, bbuffer) = pintoHost(B)
# Similarly fast
@time Mem.transfer!(bbuffer, agpu_buffer, sizeof(A))
@test collect(A_GPU) ≈ b
Mem.free(agpu_buffer)
A_GPU = finalize(A_GPU)
agpu_buffer = nothing
# NOTE: MUST BE IN THIS ORDER OR SEG FAULT CITY = YOU
x = finalize(x)
Mem.free(xbuffer)
xbuffer = finalize(xbuffer)
b = finalize(b)
Mem.free(bbuffer)
bbuffer = finalize(bbuffer)