Significant CUDA.jl memory allocations outside of main pool?

I have been benchmarking various CUDA.jl kernels with large CuArrays, and have been experiencing out of GPU memory errors after running the same function repeatedly. CUDA.jl seems to be holding onto significant amounts of memory outside of the nominal memory pool that is not freed by standard GC.

Example:

julia> CUDA.memory_status()
Effective GPU memory usage: 42.78% (6.376 GiB/14.903 GiB)
CUDA allocator usage: 2.016 GiB
Memory pool usage: 2.016 GiB (2.016 GiB allocated, 0 bytes cached)

julia> GC.gc(true)

julia> CUDA.memory_status()
Effective GPU memory usage: 39.43% (5.876 GiB/14.903 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

Checking nvidia-smi, all of this memory is indeed allocated by the julia process. I first suspected this could be a memory leak, but I have seen that this memory does appear to be occasionally freed, although I haven’t yet figured out how to regularly reproduce that. In the meantime I am getting out of memory errors in the REPL, even though all my allocations are within a single function that I am repeatedly updating and running, which presumably should not grow the overall memory footprint.

Is this expected behavior? What is this non-pool memory used for? Is there any way to force it to be freed/reclaimed? I know that device_reset!() is broken on current CUDA distributions, but is there another way to reset and free all device memory?

UPDATE: Setting the JULIA_CUDA_MEMORY_POOL environment variable to none appears to free almost all memory on GC. So perhaps I am running into some sort of fragmentation within the binned pool allocator?

with JULIA_CUDA_MEMORY_POOL=none:
julia> CUDA.memory_status()
Effective GPU memory usage: 68.17% (10.159 GiB/14.903 GiB)
CUDA allocator usage: 8.063 GiB
Memory pool usage: 8.063 GiB (8.063 GiB allocated, 0 bytes cached)

julia> GC.gc()

julia> CUDA.memory_status()
Effective GPU memory usage: 0.64% (97.125 MiB/14.903 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

Yes, the binned pool has certain overheads. If you can, please upgrade to CUDA 11.2 and CUDA.jl 3.0, the new memory pool there is based on CUDA’s stream-ordered allocator, which performs better (both in terms of pooling memory, as by enabling asynchronous memory operations).