Efficiency when handling jobs larger than VRAM


Hi! GPGPU beginner here using the CUDAnative/CuArrays stack (Thank you @maleadt for this amazing work)

I am hoping to find some help regarding efficiently transferring data to and from host/device when handling input and output data that is much larger than can be stored all at once in device memory. (in my case I have 5GB device memory to work with)

I am working on image processing that involves performing thousands of convolutions. Since the rate limiting step tends to be FFTs, I want to be running these large array stacks through the GPU.

The MWE of what I am using is below.

When I run this through nvprof I can see clearly that the actual FFT is fast and most of my time is spent on data transfer. I can also tell that the data transfer is serialized (HtD and DtH happen sequentially), but I am not sure how to effectively implement parallel transfer of data to and from the host using CUDAnative (from browsing CUDA literature online I know that I should be considering that.)

What’s strange to me are a few things:

  1. From my limited understanding there ought to be extra overhead associated with each transfer of data between host and device, such that many small transfers are slower than batched jobs. However, splitting the data into individual images rather than batched stacks turns out to be more performant in my case. The GPU speedup is also modest enough that it suggest running multithreaded on the CPU would be faster.

  2. Although I can see some speed increases as I increase the batch size, I hit a ceiling VERY quickly before I start getting memory allocation failures. Obviously when creating a FFT plan, there’s a call to check for availability of workspace for the FFT computation. However it seems absurd–passing a stack of less than 500mb throws an allocation error, implying 10X the amount memory is needed by the device to run an FFT…???

  3. In general, a function like that shown in the MWE will need to be run thousands of times in my final algorithm. I have noticed creeping memory leaks when running my code despite trying to keep everything encapsulated in its own scope/context. Is there something I can do to mitigate that?

Cheers and thanks in advance for any comments or help.

using FFTW
using CuArrays, CuArrays.CUFFT
using CUDAdrv, CUDAnative

# Data sized similar to real use. Reduce 3rd dimension if limited host RAM
A = rand(ComplexF32, 383, 765, 3000)
OUT = zeros(Float32, 765, 765, 3000)
chunksize = 100
function reverseFFT_gpu(A, OUT, chunksize)
    dev = CuDevice(0)
    ctx = CuContext(dev)
    loops = fld(size(A,3),chunksize)
    A_GPU = CuArray(A[:,:,1:chunksize])
    OUT_GPU = cuzeros(Float32, 765,765, chunksize)

    invplan = plan_brfft(A_GPU, 765, [1,2])

    println("GPU -- Chunksize: $chunksize")
    @time begin
    for k in 1:loops
        gpustart = 1 + (k-1)*chunksize
        gpustop = k*chunksize
        A_GPU .= CuArray(view(A,:,:,(gpustart:gpustop)))
        OUT_GPU .= invplan*A_GPU
        OUT[:,:,gpustart:gpustop] .= collect(OUT_GPU)

function reverseFFT_cpu(A, OUT, chunksize)

    loops = fld(size(A,3),chunksize)
    invplan = plan_brfft(A[:,:,1:chunksize], 765, [1,2])

    println("CPU -- Chunksize: $chunksize")
    @time begin
    for k in 1:loops
        cpustart = 1 + (k-1)*chunksize
        cpustop = k*chunksize
        @views OUT[:,:,cpustart:cpustop] .= invplan*A[:,:,(cpustart:cpustop)]

function reverseFFT_unbatched(A, OUT)
    dev = CuDevice(0)
    ctx = CuContext(dev)
    A_GPU = CuArray(A[:,:,1])
   invplan = plan_brfft(A_GPU, 765, [1,2])

    println("GPU -- Single Slice")
    @time begin
    for k in 1:size(A,3)
        A_GPU .= CuArray(A[:,:,k])
        OUT[:,:,k] .= collect(invplan*A_GPU)

# Run tests
reverseFFT_cpu(A, OUT, chunksize)
reverseFFT_gpu(A, OUT, chunksize)
reverseFFT_unbatched(A, OUT)


Couple of suggestions, but I don’t have the time to look into your code right now.

On 1: Why not perform batched uploads but spawn multiple kernels/operations, either using an offset or with views? Alternatively, you could try and make your upload asynchronous such that there isn’t synchronization happening. Or put independent operations/kernels on separate streams, but CuArrays.jl doesn’t have support for specifying the stream yet (CUDAnative.jl and CUDAdrv do, so it might work for you).

On 2: CuArrays may very well have some memory allocation bugs going on. Be sure to have your FFT handles go out of scope and do GC afterwards. When regular allocations fail, we do that for you, but not for library allocations: https://github.com/JuliaGPU/CuArrays.jl/issues/130

On 3: There’s some very unpolished support for tracing allocations and dumping those when stuff goes wrong: https://github.com/JuliaGPU/CuArrays.jl/blob/3fa4bf09665fd39148005cd1ef77e2933b0613d2/src/memory.jl#L99, also see https://github.com/JuliaGPU/CuArrays.jl/pull/212


Hey thanks for the feedback!

Data transfer time >>> than compute time no matter what size stack I haul over to the GPU. So I don’t think parallel kernels will make much of a difference.

CUDAdrv has indeed helped solve some issues. The other was not calling destroy_plan(). The first code example below exactly produces no memory accumulation. Some things to note: calls to GC.gc() don’t free memory ever in my hands. Calling finalize() on any CuArray similarly fails. Only CUDAdrv’s Mem.free(::Buffer) works consistently. Also (oddly) calling finalize on the context AFTER running destroy! frees memory that otherwise never gets collected.

And yes, async via streams/events is where I am headed, or at least trying to. A couple hurdles right now that maybe (please?) you could look into…

  1. Implementation of cudaMallocHost in a straightforward way (e.g. returning an indexable array wrapping the memory). I found burried in CUDAdrv.Mem.alloc() that I can set ATTACH_HOST and get what looks like the correct behavior. Have a look at the second code sample. Its super crude and obviously not stable but I get memory transfer rates between host and device that are 5X as fast…

  2. Handles for cufftSetStream() and possibly cufftEstimate*D() would be super duper helpful too.

Again thanks for your hard work. I dont have the expertise to help with coding to offer extra manpower, but perhaps I can do some documentation work and submit as pull requests as a thank you. It would help me in the end to have some functions described when searched.

using FFTW
using CuArrays, CuArrays.CUFFT
using CUDAdrv, CUDAnative, CUDAdrv.Mem

A = round.(rand(ComplexF32, 383, 765, 3000).*100)
OUT = zeros(Float32, 765, 765, 3000)
chunksize = 100

@inline function bigrfft(A,OUT,chunksize)

    dev = CuDevice(0)
    ctx = CuContext(dev)

    A_GPU = CuArray(A[:,:,1:chunksize])
    abuffer = CuArrays.buffer(A_GPU)
    OUT_GPU = cuzeros(Float32, size(OUT[:,:,1:chunksize]))
    outbuffer = CuArrays.buffer(OUT_GPU)
    p = plan_brfft(A_GPU, 765, [1,2])

    for j in 1:cld(size(A,3),chunksize)
        gpustart = 1 + (j-1)*chunksize
        gpustop = j*chunksize
        Mem.upload!(abuffer, A[:,:,gpustart:gpustop])
        Mem.download!(OUT[:,:,gpustart:gpustop], outbuffer)

    ctx = finalize(ctx)



Second example. Using pinned memory results in 5X host-gpu bandwidth on my workstation. Advisable to evaluate block by block because I can’t guarantee stability.

using CuArrays, CUDAdrv, CUDAdrv.Mem, CUDAnative
using Test

A = round.(rand(ComplexF32, 383, 765, 300).*100)
Asize = sizeof(A)

#Allocate memory to HOST that the GPU has pinned
function pintoHost(hostArray::Array{T}) where T
    arraysize = sizeof(hostArray)
    arraybuffer = CUDAdrv.Mem.alloc(arraysize,true; flags=Mem.ATTACH_HOST)
    arrayptr = Base.unsafe_convert(Ptr{T}, arraybuffer.ptr)
    pinnedarray = unsafe_wrap(Array, arrayptr, size(hostArray); own = false)
    return (pinnedarray,arraybuffer)
# Verify x is a bonafide copy
(x, xbuffer) = pintoHost(A)
@test x ≈ A

zero_GPU = cuzeros(ComplexF32,size(A))
zero1_GPU = cuzeros(ComplexF32, size(A))
zerobuffer = CuArrays.buffer(zero_GPU)
zero1buffer = CuArrays.buffer(zero1_GPU)

# Pinned memory transfer ≈ 5X faster than regular upload (for me)
@time  Mem.transfer!(zerobuffer, xbuffer, sizeof(A))
@time  Mem.upload!(zero1buffer, A)
@test collect(zero_GPU) ≈ A
@test collect(zero1_GPU) ≈ A

zero_GPU = nothing
zero1_GPU = nothing
zerobuffer = finalize(zerobuffer)
zero1buffer = finalize(zero1buffer)

# Test download for similar behavior
B = zeros(ComplexF32, size(A))
A_GPU = CuArray(rand(ComplexF32, size(A)).*100)
agpu_buffer = CuArrays.buffer(A_GPU)
(b, bbuffer) = pintoHost(B)

# Similarly fast
@time Mem.transfer!(bbuffer, agpu_buffer, sizeof(A))
@test collect(A_GPU) ≈ b

A_GPU = finalize(A_GPU)
agpu_buffer = nothing

x = finalize(x)
xbuffer = finalize(xbuffer)

b = finalize(b)
bbuffer = finalize(bbuffer)


We’re pretty short on manpower too, as the CuArrays.jl issue backlog shows… But well-isolated/documented issues are always appreciated, i.e. don’t hesitate to file issues on CUDAdrv for those specific API additions.

I’ll try and get back to some of your issues in this post when I have the time, but don’t hold your breath. The problem is that adding functionality to CUDAdrv isn’t much effort, but tying it into a usable API at the CuArrays level is much harder…