Using MVector in CUDA without memory errors

snowgum · October 16, 2023, 1:56pm

I have an application where allocating a small fixed-sized vector for temporary storage inside a kernel is very beneficial. From reading this forum, issue 340 on CUDAnative.jl and an issue 38 comment, I am assuming it is ok to use an MVector.

Using this simplified example seems to work:

using CUDA
using StaticArrays

function kernel1(a)
    nlevels = 6
    stack = ones(MVector{nlevels, Float32})

    nx = size(a)[1]
    ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if (ix <= nx)
        simple_func!(a[ix], stack)
    end
    return
end


function simple_func!(x, y)
    for i in 1:6
        y[i] = x * 2f0
    end
    return nothing
end

This runs fine, and there is no memory leak even if I ran it repeatedly (e.g. with @benchmark):

N = 2^20
a = CUDA.fill(1f0, N)
@benchmark CUDA.@sync @cuda threads=1024 blocks=1024 kernel1($a)

However, if I modify my kernel slightly to save the results into a second variable:

function kernel2(a, b)
    nlevels = 6
    tmp = ones(MVector{nlevels, Float32})

    nx = size(a)[1]
    ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    
    if (ix <= nx)
        simple_func!(a[ix], tmp)
        for i in 1:nlevels
            b[i, ix] = tmp[i]
        end
    end
    return
end

I quickly run into “Out of dynamic GPU memory” errors:

N = 2^20
a = CUDA.fill(1f0, N)
b = CUDA.fill(0f1, (6, N));

CUDA.@sync @cuda threads=1024 blocks=1024 kernel2(a,b)
# ERROR: Out of dynamic GPU memory (trying to allocate 24 bytes)
# ....

Any idea on why I am running out of memory in the second case but not the first?

vchuravy · October 16, 2023, 3:25pm

You can use @device_code_typed and @device_code_llvm to better understand what the compiler does. In your case in kernel1 The compiler removed almost everything (except for a boundscheck) since it could prove that there are no side-effects.

Looking at kernel2 I see a call to %4 = call fastcc {}* @gpu_gc_pool_alloc({ i64, i32 } %state, i64 24) which explains why you run out of space.

Now MArray is finicky, but since Julia uses bounds-checking by default and bounds checking error messages take a reference to the array itself, Julia can’t ellide the allocation for MArray…

function kernel2(a, b)
           nlevels = 6
           tmp = ones(MVector{nlevels, Float32})

           nx = size(a)[1]
           ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
           
           if (ix <= nx)
               simple_func!(a[ix], tmp)
               for i in 1:nlevels
                    b[i, ix] = @inbounds tmp[i]
               end
           end
           return
       end

So LLVM can’t unroll the loop, thus can’t prove that the boundscheck branch is dead, and can’t ellide the MArray allocation.

snowgum · October 16, 2023, 8:59pm

Thanks for the quick reply! Indeed that was the problem. Now need to keep in mind that bounds checking will not work with MArray…

vchuravy · October 17, 2023, 2:46am

We should fix that in StaticArrays.jl I just never found the time to do so.
Two problems:

I think the error is generated by checkbounds in Base Julia.
any other function that is using an MArray as an argument and is not inlined will cause the same.

Topic		Replies	Views
CUDA: MVectors always allocate memory and cause "Out of Memory Error" GPU question	2	935	June 14, 2019
Local thread memory in GPU using StaticArrays GPU question , gpu , cuda	4	6282	January 26, 2020
CUDAnative dynamic allocation GPU question , cudanative	5	1852	March 4, 2020
Memory allocation due to assignment in a fixed size mutable array General Usage	1	625	July 31, 2018
@cuDynamicSharedMem : allocating beforehand? GPU	2	1370	January 2, 2018

Using MVector in CUDA without memory errors

Related topics