Freeing memory in the GPU with CUDAdrv / CUDAnative / CuArrays

I am writing some code that calls CUDA kernels via CUDAdrv, allocates some CuArrays and uses a generic matrix addition (which I think is done via CUDAnative).

The problem I have is that after I call this code a couple of times, my GPU runs out of memory, as it seems that calling gc() does not free memory in the GPU.

What is the correct way to free memory in the GPU?

GPU memory is managed through the GC, although indirectly: when CuArray instances go out of scope and they are collected by the Julia GC, the GPU memory refcount is lowered and freed if it drops to 0. So make sure your arrays are out of scope before calling gc(), and make sure no other objects share the memory (eg. through a view). You can enable debug messagesthat print during finalization using JULIA_DEBUG=CUDAdrv on 0.7, and TRACE=1 with --compile-cache=no on 0.6.

Alternatively, you can force early collection by calling finalize on an array. IIRC this is a pretty slow call though, and we should probably add a different early-freeing mechanic. It also won’t do anything if the buffer’s refcount hasn’t dropped to 0.

EDIT: of course, I have assumed you’re talking about CUDAdrv’s CuArray. If you’re talking about CuArrays.jl, there’s an additional level of memory pooling. It should try and free by calling gc() once it encounters an out-of-memory error during allocation, and other than that the same rules from above (objects should be out of scope, refcounting) apply.

Hi again @maleadt,

Sorry, do you have any suggestions on how to debug this?

I have tried to encapsulate all my allocations in a single function, and then calling GC.gc() but apparently I am still missing some objects, as I keep running out of memory.

Is there a way to eg. list all the objects allocated in the GPU? Or at least the ones currently in scope?

Any help is much appreciated,

MWE? Or at least some details, from your original post it wasn’t clear if you are using CUDAdrv.CuArray or CuArrays.jl

Assuming the latter, we could add some infrastructure to print the live buffers in the pool (see memory.jl), but that would require some engineering. Maybe it would be easier to show some reproducing code and let us have a look

I have CUDA code in a package that I maintain. I create one CuArray from CUDAdrv.jl (to make space for multiple CURAND generators), and multiple CuArrays from CuArrays.jl to call kernels I wrote myself, and to use CuArrays.CUBLAS.gemm, which itself allocates memory for the return.

I have tried this code on a machine with a GTX 1080 (8 GB) of RAM and it breaks after 10 or so calls. On a machine with a Titan XP (12 GB), the code runs well.


(v1.0) pkg> develop

Due to,

cd ~/.julia/dev/Rayuela
git pull

To get the latest changes. Now, the following script is the MWE:

using Rayuela

function callGPU(n::Int, nsplits::Int)

  m, h, d = 8, 256, 128

  X = rand(Float32, d, n)
  B = convert(Matrix{Int16}, rand(1:256, m, n))
  C = Vector{Matrix{Float32}}(undef, m)
  for i=1:m; C[i] = rand(Float32, d, h); end

  ilsiters = [4]
  icmiters = 4
  npert = 4
  randord = true

  V = true

  B = Rayuela.encode_icm_cuda(X, B, C, ilsiters, icmiters, npert, randord, nsplits, V)

function main(breakit)

  for i = 1:100
    for j = 1:2
      callGPU(100_000, 1)
    nsplits = breakit ? 2 : 10
    callGPU(1_000_000, nsplits)


breakit = true

Hope this is minimal enough, although I can remove the Rayuela dependency if that is too much.

Note that add works just as well.

1 Like

Great, I’ll have a look. Probably not before next week due to deadlines.

1 Like

Sorry for the delay. Was going to have a look, but the code runs into a CUBLAS error:

julia> using Revise

julia> Revise.includet("src/10946.jl")
[ Info: Recompiling stale cache file /home/tbesard/Julia/depot/compiled/v1.0/Rayuela/4wdef.ji for Rayuela [84bd14ec-51ef-568a-9c69-e494d1752004]

julia> main()
Creating 100000 random states... done in 0.08 seconds
 ILS iteration 1/4 done.  0.00% new codes are equal. 100.00% new codes are better.
 ILS iteration 2/4 done. 80.12% new codes are equal.  8.71% new codes are better.
 ILS iteration 3/4 done. 84.26% new codes are equal.  4.95% new codes are better.
 ILS iteration 4/4 done. 87.63% new codes are equal.  2.54% new codes are better.
 Encoding done in 2.15 seconds
Creating 100000 random states... done in 0.02 seconds
ERROR: CUBLASError(code 14, an internal operation failed)
 [1] macro expansion at /home/tbesard/Julia/CuArrays/src/blas/error.jl:45 [inlined]
 [2] gemm!(::Char, ::Char, ::Float32, ::CuArrays.CuArray{Float32,2}, ::CuArrays.CuArray{Float32,2}, ::Float32, ::CuArrays.CuArray{Float32,2}) at /home/tbesard/Julia/CuArrays/src/blas/wrappers.jl:888
 [3] gemm at /home/tbesard/Julia/CuArrays/src/blas/wrappers.jl:903 [inlined]
 [4] encode_icm_cuda_single(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Int64,1}, ::Int64, ::Int64, ::Bool, ::Bool) at /home/tbesard/Julia/Rayuela/src/LSQ_GPU.jl:71
 [5] encode_icm_cuda(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Int64,1}, ::Int64, ::Int64, ::Bool, ::Int64, ::Bool) at /home/tbesard/Julia/Rayuela/src/LSQ_GPU.jl:231
 [6] main(::Bool) at /home/tbesard/Julia/CuArrays/devel/10946/src/10946.jl:18
 [7] main() at /home/tbesard/Julia/CuArrays/devel/10946/src/10946.jl:23
 [8] top-level scope at none:0

Any ideas?

Thanks for looking into this. Currently in crunch time due to CVPR, but will get back ASAP.