Using MVector in CUDA without memory errors

I have an application where allocating a small fixed-sized vector for temporary storage inside a kernel is very beneficial. From reading this forum, issue 340 on CUDAnative.jl and an issue 38 comment, I am assuming it is ok to use an MVector.

Using this simplified example seems to work:

using CUDA
using StaticArrays

function kernel1(a)
    nlevels = 6
    stack = ones(MVector{nlevels, Float32})

    nx = size(a)[1]
    ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if (ix <= nx)
        simple_func!(a[ix], stack)
    end
    return
end


function simple_func!(x, y)
    for i in 1:6
        y[i] = x * 2f0
    end
    return nothing
end

This runs fine, and there is no memory leak even if I ran it repeatedly (e.g. with @benchmark):

N = 2^20
a = CUDA.fill(1f0, N)
@benchmark CUDA.@sync @cuda threads=1024 blocks=1024 kernel1($a)

However, if I modify my kernel slightly to save the results into a second variable:

function kernel2(a, b)
    nlevels = 6
    tmp = ones(MVector{nlevels, Float32})

    nx = size(a)[1]
    ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if (ix <= nx)
        simple_func!(a[ix], tmp)
        for i in 1:nlevels
            b[i, ix] = tmp[i]
        end
    end
    return
end

I quickly run into “Out of dynamic GPU memory” errors:

N = 2^20
a = CUDA.fill(1f0, N)
b = CUDA.fill(0f1, (6, N));

CUDA.@sync @cuda threads=1024 blocks=1024 kernel2(a,b)
# ERROR: Out of dynamic GPU memory (trying to allocate 24 bytes)
# ....

Any idea on why I am running out of memory in the second case but not the first?

You can use @device_code_typed and @device_code_llvm to better understand what the compiler does. In your case in kernel1 The compiler removed almost everything (except for a boundscheck) since it could prove that there are no side-effects.

Looking at kernel2 I see a call to %4 = call fastcc {}* @gpu_gc_pool_alloc({ i64, i32 } %state, i64 24) which explains why you run out of space.

Now MArray is finicky, but since Julia uses bounds-checking by default and bounds checking error messages take a reference to the array itself, Julia can’t ellide the allocation for MArray…

function kernel2(a, b)
           nlevels = 6
           tmp = ones(MVector{nlevels, Float32})

           nx = size(a)[1]
           ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
           
           if (ix <= nx)
               simple_func!(a[ix], tmp)
               for i in 1:nlevels
                    b[i, ix] = @inbounds tmp[i]
               end
           end
           return
       end

So LLVM can’t unroll the loop, thus can’t prove that the boundscheck branch is dead, and can’t ellide the MArray allocation.

Thanks for the quick reply! Indeed that was the problem. Now need to keep in mind that bounds checking will not work with MArray

We should fix that in StaticArrays.jl I just never found the time to do so.
Two problems:

  1. I think the error is generated by checkbounds in Base Julia.
  2. any other function that is using an MArray as an argument and is not inlined will cause the same.