Efficiency when handling jobs larger than VRAM

Hey thanks for the feedback!

Data transfer time >>> than compute time no matter what size stack I haul over to the GPU. So I don’t think parallel kernels will make much of a difference.

CUDAdrv has indeed helped solve some issues. The other was not calling destroy_plan(). The first code example below exactly produces no memory accumulation. Some things to note: calls to GC.gc() don’t free memory ever in my hands. Calling finalize() on any CuArray similarly fails. Only CUDAdrv’s Mem.free(::Buffer) works consistently. Also (oddly) calling finalize on the context AFTER running destroy! frees memory that otherwise never gets collected.

And yes, async via streams/events is where I am headed, or at least trying to. A couple hurdles right now that maybe (please?) you could look into…

  1. Implementation of cudaMallocHost in a straightforward way (e.g. returning an indexable array wrapping the memory). I found burried in CUDAdrv.Mem.alloc() that I can set ATTACH_HOST and get what looks like the correct behavior. Have a look at the second code sample. Its super crude and obviously not stable but I get memory transfer rates between host and device that are 5X as fast…

  2. Handles for cufftSetStream() and possibly cufftEstimate*D() would be super duper helpful too.

Again thanks for your hard work. I dont have the expertise to help with coding to offer extra manpower, but perhaps I can do some documentation work and submit as pull requests as a thank you. It would help me in the end to have some functions described when searched.

using FFTW
using CuArrays, CuArrays.CUFFT
using CUDAdrv, CUDAnative, CUDAdrv.Mem

A = round.(rand(ComplexF32, 383, 765, 3000).*100)
OUT = zeros(Float32, 765, 765, 3000)
chunksize = 100

@inline function bigrfft(A,OUT,chunksize)

    dev = CuDevice(0)
    ctx = CuContext(dev)
    CUDAdrv.activate(ctx)

    A_GPU = CuArray(A[:,:,1:chunksize])
    abuffer = CuArrays.buffer(A_GPU)
    OUT_GPU = cuzeros(Float32, size(OUT[:,:,1:chunksize]))
    outbuffer = CuArrays.buffer(OUT_GPU)
    p = plan_brfft(A_GPU, 765, [1,2])

    for j in 1:cld(size(A,3),chunksize)
        gpustart = 1 + (j-1)*chunksize
        gpustop = j*chunksize
        Mem.upload!(abuffer, A[:,:,gpustart:gpustop])
        copyto!(OUT_GPU,p*A_GPU)
        CUDAdrv.synchronize(ctx)
        Mem.download!(OUT[:,:,gpustart:gpustop], outbuffer)
    end

    Mem.free(abuffer)
    Mem.free(outbuffer)
    CuArrays.CUFFT.destroy_plan(p)
    CUDAdrv.destroy!(ctx)
    ctx = finalize(ctx)

    return
end

bigrfft(A,OUT,chunksize)

Second example. Using pinned memory results in 5X host-gpu bandwidth on my workstation. Advisable to evaluate block by block because I can’t guarantee stability.

using CuArrays, CUDAdrv, CUDAdrv.Mem, CUDAnative
using Test

A = round.(rand(ComplexF32, 383, 765, 300).*100)
Asize = sizeof(A)

#Allocate memory to HOST that the GPU has pinned
function pintoHost(hostArray::Array{T}) where T
    arraysize = sizeof(hostArray)
    arraybuffer = CUDAdrv.Mem.alloc(arraysize,true; flags=Mem.ATTACH_HOST)
    arrayptr = Base.unsafe_convert(Ptr{T}, arraybuffer.ptr)
    pinnedarray = unsafe_wrap(Array, arrayptr, size(hostArray); own = false)
    copyto!(pinnedarray,hostArray)
    return (pinnedarray,arraybuffer)
end
# Verify x is a bonafide copy
(x, xbuffer) = pintoHost(A)
@test x ≈ A

zero_GPU = cuzeros(ComplexF32,size(A))
zero1_GPU = cuzeros(ComplexF32, size(A))
zerobuffer = CuArrays.buffer(zero_GPU)
zero1buffer = CuArrays.buffer(zero1_GPU)

# Pinned memory transfer ≈ 5X faster than regular upload (for me)
@time  Mem.transfer!(zerobuffer, xbuffer, sizeof(A))
@time  Mem.upload!(zero1buffer, A)
@test collect(zero_GPU) ≈ A
@test collect(zero1_GPU) ≈ A

Mem.free(zerobuffer)
Mem.free(zero1buffer)
zero_GPU = nothing
zero1_GPU = nothing
zerobuffer = finalize(zerobuffer)
zero1buffer = finalize(zero1buffer)

# Test download for similar behavior
B = zeros(ComplexF32, size(A))
A_GPU = CuArray(rand(ComplexF32, size(A)).*100)
agpu_buffer = CuArrays.buffer(A_GPU)
(b, bbuffer) = pintoHost(B)

# Similarly fast
@time Mem.transfer!(bbuffer, agpu_buffer, sizeof(A))
@test collect(A_GPU) ≈ b

Mem.free(agpu_buffer)
A_GPU = finalize(A_GPU)
agpu_buffer = nothing

# NOTE: MUST BE IN THIS ORDER OR SEG FAULT CITY = YOU
x = finalize(x)
Mem.free(xbuffer)
xbuffer = finalize(xbuffer)

b = finalize(b)
Mem.free(bbuffer)
bbuffer = finalize(bbuffer)
1 Like