I have an application where allocating a small fixed-sized vector for temporary storage inside a kernel is very beneficial. From reading this forum, issue 340 on CUDAnative.jl and an issue 38 comment, I am assuming it is ok to use an MVector.
Using this simplified example seems to work:
using CUDA
using StaticArrays
function kernel1(a)
nlevels = 6
stack = ones(MVector{nlevels, Float32})
nx = size(a)[1]
ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if (ix <= nx)
simple_func!(a[ix], stack)
end
return
end
function simple_func!(x, y)
for i in 1:6
y[i] = x * 2f0
end
return nothing
end
This runs fine, and there is no memory leak even if I ran it repeatedly (e.g. with @benchmark
):
N = 2^20
a = CUDA.fill(1f0, N)
@benchmark CUDA.@sync @cuda threads=1024 blocks=1024 kernel1($a)
However, if I modify my kernel slightly to save the results into a second variable:
function kernel2(a, b)
nlevels = 6
tmp = ones(MVector{nlevels, Float32})
nx = size(a)[1]
ix = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if (ix <= nx)
simple_func!(a[ix], tmp)
for i in 1:nlevels
b[i, ix] = tmp[i]
end
end
return
end
I quickly run into “Out of dynamic GPU memory” errors:
N = 2^20
a = CUDA.fill(1f0, N)
b = CUDA.fill(0f1, (6, N));
CUDA.@sync @cuda threads=1024 blocks=1024 kernel2(a,b)
# ERROR: Out of dynamic GPU memory (trying to allocate 24 bytes)
# ....
Any idea on why I am running out of memory in the second case but not the first?