CUDA memory isn't freed and cannot be backtracked

Marcell_Havlik · June 24, 2022, 7:18am

Hey Julianners,

I checked the following topics but couldn’t resolve my issue.
https://discourse.julialang.org/t/memory-is-not-freed-with-cuda-and-two-repls/60706
https://cuda.juliagpu.org/stable/usage/memory/
I think I am having the same issue like this but cannot free the cache:
https://discourse.julialang.org/t/significant-cuda-jl-memory-allocations-outside-of-main-pool/59991

I have a 8 hours run from the night and I cannot get the results back and run another call to finalize the final results as I ran out of memory before the action:

julia> CUDA.memory_status()
Effective GPU memory usage: 99.57% (5.758 GiB/5.783 GiB)
Memory pool usage: 5.951 MiB (5.188 GiB reserved)

Is there a way to reclaim memory or check what are the “variables” that consume this much amount of memory? Some kind of basic backtracking option?

I have tried:

CUDA.reclaim()
JULIA_CUDA_MEMORY_POOL=none # Which obviously an error as "none" doesn't exist. But tried with nothing and :none too.
GC.gc(true)

I would be glad if I would be able to track what are the things that take the 99% of the memory or I guess cache in this case as I release everything with unsafe_free! after usage.
If we would have a pointer to these address I would be glad to call CUDA.unsafe_free!(...) on them too. Is there a chance we can get some memory management tracking via any list or dict?

This laptop has only GTX 3060, which I guess is a limiting thing and that is why I need these manual release methods if it is possible.

carstenbauer · June 24, 2022, 7:33am

This is an environment variable, so if you want to set it you should do so before starting Julia, i.e. via export JULIA_CUDA_MEMORY_POOL=none (if you’re on Linux).

Marcell_Havlik · June 24, 2022, 11:08am

Thank you for noting! Linux yes.

Beside that I could turn that off, is there any chance to see the “variables” that take the spaces? Isn’t it possible somehow? It would be really useful in many situation. Also if the program stuck due to memory issue we could resolve it.

Noel_Araujo · June 25, 2022, 12:11pm

I don’t have solution, but I can confirm that this issue indeed exists. I experienced similar issues with CUDA + RemoteChannels, somehow all the commands to free memory worked in one PC, but did not in the other PC.

I changed so much stuff in my codes, that I don’ know what I’ve done to fix it.

maleadt · June 26, 2022, 11:27am

This is a perfectly normal report: you’re only using 5MB of GPU memory, while the underlying pool (which allocations are made in) is currently sized around 5GB, thus consuming most of the physical memory on your device. This does not mean that the memory is unavailable, you can allocate 5GB-5MB. So this isn’t indicative of an OOM, or a memory leak.

Marcell_Havlik · June 26, 2022, 12:52pm

Thank you for the answer.
I already cancelled the session. So sadly the run already lost.

Beside that I could turn that off, is there any chance to see the “variables” that take the spaces? Isn’t it possible somehow? It would be really useful in many situation. Also if the program stuck due to memory issue we could resolve it.

Is there a chance we can have a feature where we can see what variables can we unsafe_free!?

maleadt · June 27, 2022, 7:38am

That currently isn’t possible. Keeping track of all allocated objects is expensive, and we’d only be able to report the backtrace to the original allocation site. If that’s useful to you, open an issue, or rather have a look at adding some bookkeeping to https://github.com/JuliaGPU/CUDA.jl/blob/5b34542d704f9b13cefb7b732d2e8bf9cbf9638a/src/pool.jl#L292-L330= – you’d get pretty far by just using a global Dict and putting objects in it together with the current backtrace(). If you only care about CuArray allocations, it might be better to add some bookkeeping to the ArrayStorage – https://github.com/JuliaGPU/CUDA.jl/blob/5b34542d704f9b13cefb7b732d2e8bf9cbf9638a/src/array.jl#L9-L17= – because that knows about arrays that share the same buffer (e.g. with views, reshapes, reinterprets, etc).

Marcell_Havlik · June 27, 2022, 9:05am

Indeed. I could modify the array management function to track these.

I have to understand how these function refer to each other. Thank you for your answer!

Sixzero · August 29, 2022, 11:01am

Still I find it interesting, to have a zero overhead flag switchable report on the allocations, even if it is just the report of the backtrace of the original allocations site.

So next time others can report it also here, so we can have better picture on why this would happen.

Sixzero · August 29, 2022, 11:03am

Also I think only if this issue surfaces itself again, only then we should take actions on it.
As of now just take a note on this, I think.

Topic		Replies	Views
Memory is not freed with CUDA and two REPLs GPU cuda	8	1516	May 7, 2021
Why is it consuming and not freeing GPU memory? GPU	5	457	April 18, 2024
Significant CUDA.jl memory allocations outside of main pool? GPU memory	2	1404	August 6, 2022
Is there a way to explicitly free GPU memory? GPU	3	2614	December 15, 2019
Freeing memory in the GPU with CUDAdrv / CUDAnative / CuArrays GPU	8	3048	November 13, 2018

CUDA memory isn't freed and cannot be backtracked

Related topics