I have a 8 hours run from the night and I cannot get the results back and run another call to finalize the final results as I ran out of memory before the action:
Is there a way to reclaim memory or check what are the “variables” that consume this much amount of memory? Some kind of basic backtracking option?
I have tried:
CUDA.reclaim()
JULIA_CUDA_MEMORY_POOL=none # Which obviously an error as "none" doesn't exist. But tried with nothing and :none too.
GC.gc(true)
I would be glad if I would be able to track what are the things that take the 99% of the memory or I guess cache in this case as I release everything with unsafe_free! after usage.
If we would have a pointer to these address I would be glad to call CUDA.unsafe_free!(...) on them too. Is there a chance we can get some memory management tracking via any list or dict?
This laptop has only GTX 3060, which I guess is a limiting thing and that is why I need these manual release methods if it is possible.
This is an environment variable, so if you want to set it you should do so before starting Julia, i.e. via export JULIA_CUDA_MEMORY_POOL=none (if you’re on Linux).
Beside that I could turn that off, is there any chance to see the “variables” that take the spaces? Isn’t it possible somehow? It would be really useful in many situation. Also if the program stuck due to memory issue we could resolve it.
I don’t have solution, but I can confirm that this issue indeed exists. I experienced similar issues with CUDA + RemoteChannels, somehow all the commands to free memory worked in one PC, but did not in the other PC.
I changed so much stuff in my codes, that I don’ know what I’ve done to fix it.
This is a perfectly normal report: you’re only using 5MB of GPU memory, while the underlying pool (which allocations are made in) is currently sized around 5GB, thus consuming most of the physical memory on your device. This does not mean that the memory is unavailable, you can allocate 5GB-5MB. So this isn’t indicative of an OOM, or a memory leak.
Thank you for the answer.
I already cancelled the session. So sadly the run already lost.
Beside that I could turn that off, is there any chance to see the “variables” that take the spaces? Isn’t it possible somehow? It would be really useful in many situation. Also if the program stuck due to memory issue we could resolve it.
Is there a chance we can have a feature where we can see what variables can we unsafe_free!?
Still I find it interesting, to have a zero overhead flag switchable report on the allocations, even if it is just the report of the backtrace of the original allocations site.
So next time others can report it also here, so we can have better picture on why this would happen.