I have CUDA code in a package that I maintain. I create one CuArray
from CUDAdrv.jl
(to make space for multiple CURAND generators), and multiple CuArray
s from CuArrays.jl
to call kernels I wrote myself, and to use CuArrays.CUBLAS.gemm
, which itself allocates memory for the return.
I have tried this code on a machine with a GTX 1080 (8 GB) of RAM and it breaks after 10 or so calls. On a machine with a Titan XP (12 GB), the code runs well.
MWE:
(v1.0) pkg> develop https://github.com/una-dinosauria/Rayuela.jl.git
Due to https://github.com/JuliaLang/Pkg.jl/issues/465,
cd ~/.julia/dev/Rayuela
git pull
To get the latest changes. Now, the following script is the MWE:
using Rayuela
function callGPU(n::Int, nsplits::Int)
m, h, d = 8, 256, 128
X = rand(Float32, d, n)
B = convert(Matrix{Int16}, rand(1:256, m, n))
C = Vector{Matrix{Float32}}(undef, m)
for i=1:m; C[i] = rand(Float32, d, h); end
ilsiters = [4]
icmiters = 4
npert = 4
randord = true
V = true
B = Rayuela.encode_icm_cuda(X, B, C, ilsiters, icmiters, npert, randord, nsplits, V)
B
end
function main(breakit)
for i = 1:100
for j = 1:2
callGPU(100_000, 1)
GC.gc()
end
nsplits = breakit ? 2 : 10
callGPU(1_000_000, nsplits)
GC.gc()
end
end
breakit = true
main(breakit)
Hope this is minimal enough, although I can remove the Rayuela dependency if that is too much.