I can’t directly answer your question as I don’t know much about Lux, but here are my plain CUDA findings.
TLDR: In a simple test not using Lux, I do not encounter any memory leaks. Taking into account that in multithreading you are naturally asking for more VRAM, the memory consumption looks fine.
- In a single-threaded test without
unsafe_free!
using CUDA
using .Threads
function test()
s = 0.f0
for i = 1:10
data_cpu = rand(Float32, 250_000_000) # 1 GB, with data .^ 2 we need 2GB in one iteration
data = data_cpu |> cu
s += sum(data .^ 2) # s to use data, i.e. make sure it does not get compiled away
CUDA.pool_status()
end
return s
end
test();
I get
Output
Effective GPU memory usage: 36.98% (2.958 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (1.875 GiB reserved)
Effective GPU memory usage: 60.42% (4.833 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (3.750 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 5.588 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.47% (7.077 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.67% (7.093 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 5.588 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.47% (7.077 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 5.588 GiB (5.594 GiB reserved)
Effective GPU memory usage: 96.25% (7.699 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (6.531 GiB reserved)
i.e. memory does not get freed in the memory pool until we run out, at which point it does indeed get freed. This is what you would expect.
- Adding
CUDA.unsafe_free!(data)
before CUDA.pool_status()
gives me
Output
Effective GPU memory usage: 36.98% (2.958 GiB/7.999 GiB)
Memory pool usage: 953.674 MiB (1.875 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
So now memory usage remains constant (not almost 0, but this might be due to me not freeing data .^ 2
). Therefore, in my simple (non-Lux) test I don’t get any issues when using single-threading.
- With multithreading, by just using
@threads for i = 1:10
, with nthreads() == 8
, I do run into trouble, also with the unsafe_free!
:
Output
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.245 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 9.313 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 8.382 GiB (14.906 GiB reserved)
Memory pool usage: 11.176 GiB (14.906 GiB reserved)
Memory pool usage: 9.313 GiB (14.906 GiB reserved)
Memory pool usage: 7.451 GiB (14.906 GiB reserved)
Memory pool usage: 8.382 GiB (14.906 GiB reserved)
Memory pool usage: 12.107 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.245 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 9.313 GiB (14.906 GiB reserved)
(Note that the print order gets garbled due to the threading. Also, the s +=
line will lead to incorrect results, as we are not working atomically, but this is not important here.)
But this also seems fine, considering we need 16 GB of VRAM at a time (every one of the 8 threads needs 2 GB of VRAM), which obviously does not fit into my 8 GB of video memory.
- When increasing the number of iterations to
@threaded for i = 1:100
, the typical output near the end is
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 9.313 GiB (15.844 GiB reserved)
which shows we do not in fact have a memory leak. Indeed, at the end we will have requested 200 GB of VRAM, which has been freed in time. The problem remains that we need 16 GB at any one time.
- (Adding
GC.gc()
and CUDA.reclaim()
, the memory pool usage occasionally drops, but then fills up rapidly again.)
In conclusion, at least on my system and in this simple test not using Lux, I cannot observe any unexpected memory behaviour. So I would suspect the issue indeed lies within Lux itself, or your system just has too many threads for the concurrent GPU allocations.
Also, I’m not an expert by any means, so perhaps this is a silly question, but does it even make sense to combine multithreading and the GPU in this manner? I know GPU streams exist, so you can run multiple kernels concurrently on a GPU, but would in general it not be easier to use multithreading to continuously fill some CPU data queue, which then gets handled sequentially by the GPU?