Significant CUDA.jl memory allocations outside of main pool?

I have been benchmarking various CUDA.jl kernels with large CuArrays, and have been experiencing out of GPU memory errors after running the same function repeatedly. CUDA.jl seems to be holding onto significant amounts of memory outside of the nominal memory pool that is not freed by standard GC.

Example:

julia> CUDA.memory_status()
Effective GPU memory usage: 42.78% (6.376 GiB/14.903 GiB)
CUDA allocator usage: 2.016 GiB
Memory pool usage: 2.016 GiB (2.016 GiB allocated, 0 bytes cached)

julia> GC.gc(true)

julia> CUDA.memory_status()
Effective GPU memory usage: 39.43% (5.876 GiB/14.903 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

Checking nvidia-smi, all of this memory is indeed allocated by the julia process. I first suspected this could be a memory leak, but I have seen that this memory does appear to be occasionally freed, although I haven’t yet figured out how to regularly reproduce that. In the meantime I am getting out of memory errors in the REPL, even though all my allocations are within a single function that I am repeatedly updating and running, which presumably should not grow the overall memory footprint.

Is this expected behavior? What is this non-pool memory used for? Is there any way to force it to be freed/reclaimed? I know that device_reset!() is broken on current CUDA distributions, but is there another way to reset and free all device memory?

UPDATE: Setting the JULIA_CUDA_MEMORY_POOL environment variable to none appears to free almost all memory on GC. So perhaps I am running into some sort of fragmentation within the binned pool allocator?

with JULIA_CUDA_MEMORY_POOL=none:
julia> CUDA.memory_status()
Effective GPU memory usage: 68.17% (10.159 GiB/14.903 GiB)
CUDA allocator usage: 8.063 GiB
Memory pool usage: 8.063 GiB (8.063 GiB allocated, 0 bytes cached)

julia> GC.gc()

julia> CUDA.memory_status()
Effective GPU memory usage: 0.64% (97.125 MiB/14.903 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

Yes, the binned pool has certain overheads. If you can, please upgrade to CUDA 11.2 and CUDA.jl 3.0, the new memory pool there is based on CUDA’s stream-ordered allocator, which performs better (both in terms of pooling memory, as by enabling asynchronous memory operations).

1 Like

I am now running CUDA.jl 3.12.0 and CUDA 11.7 but unfortunately I am continuing to wrestle with this issue, and I figured I would resurrect this old question rather than start a new thread. After running my CUDA.jl code multiple times (either from the REPL or within a loop inside a function), GPU memory seems to accumulate without being freed. This is something we’ve always fought with but were able to work around. However it has now become an issue since I am trying to perform some batch processing that is regularly aborting with OOM errors and must be manually restarted. I have tried various combinations of reclaim, gc() and device_reset() to no avail:

julia> GC.gc(true)
julia> CUDA.reclaim()
julia> CUDA.device_reset!()
julia> CUDA.memory_status()
Effective GPU memory usage: 72.42% (34.429 GiB/47.544 GiB)
Memory pool usage: 33.151 GiB (34.156 GiB reserved)  

Unlike the scenario described above in my original post, memory is not simply being reserved by the pool, but seems to actually still be allocated. However, all of my the CUDA code is encapsulated behind several functions that have all returned and there are no CUDA.jl objects or data in scope as best as I can tell.

julia> varinfo()
  name                           size summary                                                  
  ––––––––––––––––––––––– ––––––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  Base                                Module                                                   
  Core                                Module                                                   
  InteractiveUtils        255.544 KiB Module                                                   
  Main                                Module                                                   
  ans                         0 bytes Nothing                                                  
  batchProcess                0 bytes batchProcess (generic function with 1 method)            
  postProcess                 0 bytes postProcess (generic function with 1 method)             
  sortFiles                   0 bytes sortFiles (generic function with 1 method)    

If it matters, my code is multi-threaded, multistream, although again all of these threads have terminated. I did see this which might be relevant. However, my threads that utilize CUDA.jl return β€œnothing”