Multhreading & GPU memory management

I want to perform inference of a Lux model in multithreading. In each thread, I will use X = X |> dev to move data to GPU and perform inference in GPU. However, it seems when number of threads increase, the GPU memory consumes quickly and an out of GPU memory crash occurs.

I try to put following line

CUDA.unsafe_free!(X)

after inference but without luck.

What is the best practice to perform GPU inference in parallel?

Does CUDA or LuxCUDA provides a check memory function so that I can write a checking loop to run the inference only when memory is available? Or can I use try...catch...end in a while loop?

The pseudocode is

using LuxCUDA
using Lux

best = load_best_model_from_file()
model = best.model
ps, st = best.ps, Lux.testmode(best.st)
ps_dev, st_dev = ps |> dev, st |> dev
for date in dates  # for multiple days
    records = load_data_by_date(date)
    Threads.@threads for record in records
        X, y = prepare_lux_data(record)
        X_dev = X |> dev
        yp_dev, _ = model(X_dev, ps_dev, st_dev)
        # do some processing and save yp
        CUDA.unsafe_free!(X_dev)
        CUDA.unsafe_free!(yp_dev)
    end
    # GPU memory is not freed here and accumulate to the next day which leads to out of memory crash.
end

Sorry to bump. But no one has any idea about this topic?

Is that due to Lux has so small community?

Not sure if this is a red herring, but have you looked at Nsight?

Also maybe you can check this video:

The pseudocode is

using LuxCUDA
using Lux

model = load_best_model_from_file()
ps, st = model.ps, Lux.testmode(model.st)
ps_dev, st_dev = ps |> dev, st |> dev
for date in dates  # for multiple days
    records = load_data_by_date(date)
    Threads.@threads for record in records
        X, y = prepare_lux_data(record)
        X_dev = X |> dev
        yp_dev, _ = model(X_dev, ps_dev, st_dev)
        # do some processing and save yp
        CUDA.unsafe_free!(X_dev)
        CUDA.unsafe_free!(yp_dev)
    end
    # GPU memory is not freed here and accumulate to the next day which leads to out of memory crash.
end

Does memory keep growing if there is no threading?

It seems so.
If I put CUDA.pool_status before and after CUDA.unsafe_free! lines:

[ Info: before unsafe_free!
Effective GPU memory usage: 27.64% (6.537 GiB/23.650 GiB)
Memory pool usage: 4.135 GiB (5.250 GiB reserved)
Memory limit: soft = 23.650 GiB
[ Info: after unsafe_free!
Effective GPU memory usage: 27.64% (6.537 GiB/23.650 GiB)
Memory pool usage: 4.135 GiB (5.250 GiB reserved)
Memory limit: soft = 23.650 GiB

It seems no GPU memory has been freed. I also try to put

X_dev, yp_dev = nothing, nothing
CUDA.reclaim()

but the same result.

What should I do to ensure the GPU memory is freed?

I can’t directly answer your question as I don’t know much about Lux, but here are my plain CUDA findings.

TLDR: In a simple test not using Lux, I do not encounter any memory leaks. Taking into account that in multithreading you are naturally asking for more VRAM, the memory consumption looks fine.

  • In a single-threaded test without unsafe_free!
using CUDA
using .Threads

function test()
	s = 0.f0
	for i = 1:10
	   data_cpu = rand(Float32, 250_000_000)  # 1 GB, with data .^ 2 we need 2GB in one iteration
	   data = data_cpu |> cu
	   s += sum(data .^ 2)  # s to use data, i.e. make sure it does not get compiled away
	   CUDA.pool_status()
	end
	return s
end

test();

I get

Output

Effective GPU memory usage: 36.98% (2.958 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (1.875 GiB reserved)
Effective GPU memory usage: 60.42% (4.833 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (3.750 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 5.588 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.47% (7.077 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.67% (7.093 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 5.588 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.47% (7.077 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (5.594 GiB reserved)
Effective GPU memory usage: 88.44% (7.075 GiB/7.999 GiB)
Memory pool usage: 5.588 GiB (5.594 GiB reserved)
Effective GPU memory usage: 96.25% (7.699 GiB/7.999 GiB)
Memory pool usage: 3.725 GiB (6.531 GiB reserved)

i.e. memory does not get freed in the memory pool until we run out, at which point it does indeed get freed. This is what you would expect.

  • Adding CUDA.unsafe_free!(data) before CUDA.pool_status() gives me
Output

Effective GPU memory usage: 36.98% (2.958 GiB/7.999 GiB)
Memory pool usage: 953.674 MiB (1.875 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)
Effective GPU memory usage: 48.70% (3.896 GiB/7.999 GiB)
Memory pool usage: 1.863 GiB (2.812 GiB reserved)

So now memory usage remains constant (not almost 0, but this might be due to me not freeing data .^ 2). Therefore, in my simple (non-Lux) test I don’t get any issues when using single-threading.

  • With multithreading, by just using @threads for i = 1:10, with nthreads() == 8, I do run into trouble, also with the unsafe_free!:
Output

Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.245 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 9.313 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 8.382 GiB (14.906 GiB reserved)
Memory pool usage: 11.176 GiB (14.906 GiB reserved)
Memory pool usage: 9.313 GiB (14.906 GiB reserved)
Memory pool usage: 7.451 GiB (14.906 GiB reserved)
Memory pool usage: 8.382 GiB (14.906 GiB reserved)
Memory pool usage: 12.107 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.245 GiB (14.906 GiB reserved)
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 9.313 GiB (14.906 GiB reserved)

(Note that the print order gets garbled due to the threading. Also, the s += line will lead to incorrect results, as we are not working atomically, but this is not important here.)

But this also seems fine, considering we need 16 GB of VRAM at a time (every one of the 8 threads needs 2 GB of VRAM), which obviously does not fit into my 8 GB of video memory.

  • When increasing the number of iterations to @threaded for i = 1:100, the typical output near the end is
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 9.313 GiB (15.844 GiB reserved)

which shows we do not in fact have a memory leak. Indeed, at the end we will have requested 200 GB of VRAM, which has been freed in time. The problem remains that we need 16 GB at any one time.

  • (Adding GC.gc() and CUDA.reclaim(), the memory pool usage occasionally drops, but then fills up rapidly again.)

In conclusion, at least on my system and in this simple test not using Lux, I cannot observe any unexpected memory behaviour. So I would suspect the issue indeed lies within Lux itself, or your system just has too many threads for the concurrent GPU allocations.


Also, I’m not an expert by any means, so perhaps this is a silly question, but does it even make sense to combine multithreading and the GPU in this manner? I know GPU streams exist, so you can run multiple kernels concurrently on a GPU, but would in general it not be easier to use multithreading to continuously fill some CPU data queue, which then gets handled sequentially by the GPU?

I also don’t know the best practice here. I only tried a most naive way to do the job.

Maybe the better way is to combine all my data using multithreading and perform GPU inference in batch. But the data to be inferred is so large that it cannot fit into current memory or takes extremely long time to collect them. The bad news is I also don’t know how to use data streaming or similar techniques.

I know that in PyTorch you can use their DataLoader class to easily load data lazily (i.e. without collecting) in parallel on the CPU, after which you can then feed it sequentially to the GPU. Worse case scenario, you could PythonCall to use PyTorch’s DataLoader.

But presumably Lux and Flux have a similar mechanism. Specifically, I suspect MLUtils.DataLoader will be useful.

Use MLUtils.Dataloader as @eldee suggested. For a full tutorial on how to do this with Lux, see Lux.jl/examples/ConvMixer/main.jl at main · LuxDL/Lux.jl · GitHub

I currently use the following try-catch while loop. It seems works but about 50% slower than the case when the naive version does not crash:

while true
    try
        X_dev = config_pred.gpu ? (x |> dev) : x
        yp_dev, _ = model(X_dev, ps_dev, st_dev)
        yp_ = config_pred.gpu ? yp_dev |> cpu_device() : yp_dev
        yp[i:j] .= vec(yp_)
        X_dev, yp_dev = nothing, nothing
        break
    catch
        !config_pred.ignore_warn && @warn "$code: GPU memory unavailable. Retry!"
        config_pred.gpu && CUDA.reclaim()
        GC.gc()
        continue
    end
    sleep(0.05)
end

The 50% slower is caused by occasionally all threads hits the GPU memory limit and several reclaims inside the catch block are called.

But currently the 50% slowness is acceptable (still 3x times faster than 1-thread approach).