Freeing memory in the GPU with CUDAdrv / CUDAnative / CuArrays

una-dinosauria · May 17, 2018, 1:16am

I am writing some code that calls CUDA kernels via CUDAdrv, allocates some CuArrays and uses a generic matrix addition (which I think is done via CUDAnative).

The problem I have is that after I call this code a couple of times, my GPU runs out of memory, as it seems that calling gc() does not free memory in the GPU.

What is the correct way to free memory in the GPU?

maleadt · May 17, 2018, 5:51am

GPU memory is managed through the GC, although indirectly: when CuArray instances go out of scope and they are collected by the Julia GC, the GPU memory refcount is lowered and freed if it drops to 0. So make sure your arrays are out of scope before calling gc(), and make sure no other objects share the memory (eg. through a view). You can enable debug messagesthat print during finalization using JULIA_DEBUG=CUDAdrv on 0.7, and TRACE=1 with --compile-cache=no on 0.6.

Alternatively, you can force early collection by calling finalize on an array. IIRC this is a pretty slow call though, and we should probably add a different early-freeing mechanic. It also won’t do anything if the buffer’s refcount hasn’t dropped to 0.

EDIT: of course, I have assumed you’re talking about CUDAdrv’s CuArray. If you’re talking about CuArrays.jl, there’s an additional level of memory pooling. It should try and free by calling gc() once it encounters an out-of-memory error during allocation, and other than that the same rules from above (objects should be out of scope, refcounting) apply.

una-dinosauria · October 3, 2018, 1:37am

Hi again @maleadt,

Sorry, do you have any suggestions on how to debug this?

I have tried to encapsulate all my allocations in a single function, and then calling GC.gc() but apparently I am still missing some objects, as I keep running out of memory.

Is there a way to eg. list all the objects allocated in the GPU? Or at least the ones currently in scope?

Any help is much appreciated,

maleadt · October 4, 2018, 6:21am

MWE? Or at least some details, from your original post it wasn’t clear if you are using CUDAdrv.CuArray or CuArrays.jl

Assuming the latter, we could add some infrastructure to print the live buffers in the pool (see memory.jl), but that would require some engineering. Maybe it would be easier to show some reproducing code and let us have a look

una-dinosauria · October 6, 2018, 5:10pm

I have CUDA code in a package that I maintain. I create one CuArray from CUDAdrv.jl (to make space for multiple CURAND generators), and multiple CuArrays from CuArrays.jl to call kernels I wrote myself, and to use CuArrays.CUBLAS.gemm, which itself allocates memory for the return.

I have tried this code on a machine with a GTX 1080 (8 GB) of RAM and it breaks after 10 or so calls. On a machine with a Titan XP (12 GB), the code runs well.

MWE:

(v1.0) pkg> develop https://github.com/una-dinosauria/Rayuela.jl.git

Due to https://github.com/JuliaLang/Pkg.jl/issues/465,

cd ~/.julia/dev/Rayuela
git pull

To get the latest changes. Now, the following script is the MWE:

using Rayuela

function callGPU(n::Int, nsplits::Int)

  m, h, d = 8, 256, 128

  X = rand(Float32, d, n)
  B = convert(Matrix{Int16}, rand(1:256, m, n))
  C = Vector{Matrix{Float32}}(undef, m)
  for i=1:m; C[i] = rand(Float32, d, h); end

  ilsiters = [4]
  icmiters = 4
  npert = 4
  randord = true

  V = true

  B = Rayuela.encode_icm_cuda(X, B, C, ilsiters, icmiters, npert, randord, nsplits, V)
  B
end

function main(breakit)

  for i = 1:100
    for j = 1:2
      callGPU(100_000, 1)
      GC.gc()
    end
    nsplits = breakit ? 2 : 10
    callGPU(1_000_000, nsplits)
    GC.gc()
  end

end

breakit = true
main(breakit)

Hope this is minimal enough, although I can remove the Rayuela dependency if that is too much.

kristoffer.carlsson · October 6, 2018, 5:22pm

Note that add https://github.com/una-dinosauria/Rayuela.jl.git works just as well.

maleadt · October 8, 2018, 8:30am

Great, I’ll have a look. Probably not before next week due to deadlines.

maleadt · November 13, 2018, 10:54am

Sorry for the delay. Was going to have a look, but the code runs into a CUBLAS error:

julia> using Revise

julia> Revise.includet("src/10946.jl")
[ Info: Recompiling stale cache file /home/tbesard/Julia/depot/compiled/v1.0/Rayuela/4wdef.ji for Rayuela [84bd14ec-51ef-568a-9c69-e494d1752004]

julia> main()
Creating 100000 random states... done in 0.08 seconds
 ILS iteration 1/4 done.  0.00% new codes are equal. 100.00% new codes are better.
 ILS iteration 2/4 done. 80.12% new codes are equal.  8.71% new codes are better.
 ILS iteration 3/4 done. 84.26% new codes are equal.  4.95% new codes are better.
 ILS iteration 4/4 done. 87.63% new codes are equal.  2.54% new codes are better.
 Encoding done in 2.15 seconds
Creating 100000 random states... done in 0.02 seconds
ERROR: CUBLASError(code 14, an internal operation failed)
Stacktrace:
 [1] macro expansion at /home/tbesard/Julia/CuArrays/src/blas/error.jl:45 [inlined]
 [2] gemm!(::Char, ::Char, ::Float32, ::CuArrays.CuArray{Float32,2}, ::CuArrays.CuArray{Float32,2}, ::Float32, ::CuArrays.CuArray{Float32,2}) at /home/tbesard/Julia/CuArrays/src/blas/wrappers.jl:888
 [3] gemm at /home/tbesard/Julia/CuArrays/src/blas/wrappers.jl:903 [inlined]
 [4] encode_icm_cuda_single(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Int64,1}, ::Int64, ::Int64, ::Bool, ::Bool) at /home/tbesard/Julia/Rayuela/src/LSQ_GPU.jl:71
 [5] encode_icm_cuda(::Array{Float32,2}, ::Array{Int16,2}, ::Array{Array{Float32,2},1}, ::Array{Int64,1}, ::Int64, ::Int64, ::Bool, ::Int64, ::Bool) at /home/tbesard/Julia/Rayuela/src/LSQ_GPU.jl:231
 [6] main(::Bool) at /home/tbesard/Julia/CuArrays/devel/10946/src/10946.jl:18
 [7] main() at /home/tbesard/Julia/CuArrays/devel/10946/src/10946.jl:23
 [8] top-level scope at none:0

Any ideas?

una-dinosauria · November 13, 2018, 7:28pm

Thanks for looking into this. Currently in crunch time due to CVPR, but will get back ASAP.

Topic		Replies	Views
Is there a way to explicitly free GPU memory? GPU	3	2617	December 15, 2019
Avoiding Memory leaks using CuArrays GPU performance , flux	3	1636	May 24, 2019
Reseting Device GPU	20	1871	July 6, 2021
Any way to delete an object and free memory? GPU	5	3159	January 19, 2021
Significant CUDA.jl memory allocations outside of main pool? GPU memory	2	1408	August 6, 2022

Freeing memory in the GPU with CUDAdrv / CUDAnative / CuArrays

Related topics