CUDA memory isn't freed and cannot be backtracked

Hey Julianners,

I checked the following topics but couldn’t resolve my issue.
https://discourse.julialang.org/t/memory-is-not-freed-with-cuda-and-two-repls/60706
https://cuda.juliagpu.org/stable/usage/memory/
I think I am having the same issue like this but cannot free the cache:
https://discourse.julialang.org/t/significant-cuda-jl-memory-allocations-outside-of-main-pool/59991

I have a 8 hours run from the night and I cannot get the results back and run another call to finalize the final results as I ran out of memory before the action:

julia> CUDA.memory_status()
Effective GPU memory usage: 99.57% (5.758 GiB/5.783 GiB)
Memory pool usage: 5.951 MiB (5.188 GiB reserved)

Is there a way to reclaim memory or check what are the “variables” that consume this much amount of memory? Some kind of basic backtracking option?

I have tried:

CUDA.reclaim()
JULIA_CUDA_MEMORY_POOL=none # Which obviously an error as "none" doesn't exist. But tried with nothing and :none too.
GC.gc(true)

I would be glad if I would be able to track what are the things that take the 99% of the memory or I guess cache in this case as I release everything with unsafe_free! after usage.
If we would have a pointer to these address I would be glad to call CUDA.unsafe_free!(...) on them too. Is there a chance we can get some memory management tracking via any list or dict?

This laptop has only GTX 3060, which I guess is a limiting thing and that is why I need these manual release methods if it is possible.

This is an environment variable, so if you want to set it you should do so before starting Julia, i.e. via export JULIA_CUDA_MEMORY_POOL=none (if you’re on Linux).

1 Like

Thank you for noting! Linux yes.

Beside that I could turn that off, is there any chance to see the “variables” that take the spaces? Isn’t it possible somehow? It would be really useful in many situation. Also if the program stuck due to memory issue we could resolve it.

I don’t have solution, but I can confirm that this issue indeed exists. I experienced similar issues with CUDA + RemoteChannels, somehow all the commands to free memory worked in one PC, but did not in the other PC.

I changed so much stuff in my codes, that I don’ know what I’ve done to fix it.

1 Like

This is a perfectly normal report: you’re only using 5MB of GPU memory, while the underlying pool (which allocations are made in) is currently sized around 5GB, thus consuming most of the physical memory on your device. This does not mean that the memory is unavailable, you can allocate 5GB-5MB. So this isn’t indicative of an OOM, or a memory leak.

1 Like

Thank you for the answer.
I already cancelled the session. So sadly the run already lost.

Beside that I could turn that off, is there any chance to see the “variables” that take the spaces? Isn’t it possible somehow? It would be really useful in many situation. Also if the program stuck due to memory issue we could resolve it.

Is there a chance we can have a feature where we can see what variables can we unsafe_free!?

That currently isn’t possible. Keeping track of all allocated objects is expensive, and we’d only be able to report the backtrace to the original allocation site. If that’s useful to you, open an issue, or rather have a look at adding some bookkeeping to https://github.com/JuliaGPU/CUDA.jl/blob/5b34542d704f9b13cefb7b732d2e8bf9cbf9638a/src/pool.jl#L292-L330= – you’d get pretty far by just using a global Dict and putting objects in it together with the current backtrace(). If you only care about CuArray allocations, it might be better to add some bookkeeping to the ArrayStorage – https://github.com/JuliaGPU/CUDA.jl/blob/5b34542d704f9b13cefb7b732d2e8bf9cbf9638a/src/array.jl#L9-L17= – because that knows about arrays that share the same buffer (e.g. with views, reshapes, reinterprets, etc).

1 Like

Indeed. I could modify the array management function to track these.

I have to understand how these function refer to each other. Thank you for your answer!

Still I find it interesting, to have a zero overhead flag switchable report on the allocations, even if it is just the report of the backtrace of the original allocations site.

So next time others can report it also here, so we can have better picture on why this would happen. :slight_smile:

Also I think only if this issue surfaces itself again, only then we should take actions on it.
As of now just take a note on this, I think.