CUDA.jl with @threads causing memory leak?

Hello. First of all, I am using Julia v1.12.6, CUDA v5.11.2, and FLoops v0.2.2. Consider the following code and comments.

using CUDA
using Base.Threads

function func1()
    A = CuArray{Float64}(undef, 1500, 1500, 1000)
    for i in 1:size(A,3)
        A_ = view(A, :, :, i)
    end
end

function func2()
    A = CuArray{Float64}(undef, 1500, 1500, 1000)
    @threads for i in 1:size(A,3)
        A_ = view(A, :, :, i)
    end
    return
end

# GPU almost empty
println("\nA")
CUDA.memory_status()

# do task without threading
func1()

# memory still allocated, but that is okay
println("\nB")
CUDA.memory_status()

# manually clean up
GC.gc()
CUDA.reclaim()

# GPU empty
println("\nC")
CUDA.memory_status()

# do task with threading threading
func2()

# memory still allocated, but that is okay
println("\nD")
CUDA.memory_status()

# manually clean up
GC.gc()
CUDA.reclaim()

# NOT DEALLOCATED!!!!! BUG?????
println("\nE")
CUDA.memory_status()

When I run this code I get the following output

A
Effective GPU memory usage: 3.39% (821.250 MiB/23.643 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

B
Effective GPU memory usage: 74.37% (17.583 GiB/23.643 GiB)
Memory pool usage: 16.764 GiB (16.781 GiB reserved)

C
Effective GPU memory usage: 3.39% (821.250 MiB/23.643 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

D
Effective GPU memory usage: 74.39% (17.587 GiB/23.643 GiB)
Memory pool usage: 16.764 GiB (16.781 GiB reserved)

E
Effective GPU memory usage: 74.39% (17.587 GiB/23.643 GiB)
Memory pool usage: 16.764 GiB (16.781 GiB reserved)

Note that memory is not freed between E and D. What is going on?

Explicitly free A:

function func2()
    A = CuArray{Float64}(undef, 100, 100, 100)
    @threads for i in 1:size(A,3)
        A_ = view(A, :, :, i)
    end
    CUDA.unsafe_free!(A)
    return
end

AFAIK, it has to do with something has julia threads and the garbage collector interact.

Does the multithreaded func2 (without CUDA.unsafe_free! or CUDA.reclaim) actually run out-of-memory if you call it in a loop?

I guess you have seen this comment.

It would also be interesting to see what happens if you call CUDA.reclaim on all threads.

Both the following “fix” the issue.

# This is the best current work-arround
function func2()
    A = CuArray{Float64}(undef, 1500, 1500, 1000)
    @threads for i in 1:size(A,3)
        A_ = view(A, :, :, i)
        A_ = nothing
    end

    CUDA.unsafe_free!(A)

    return
end

# This also works, though I am not sure if 
# in general every thread will be utilized.
function func2()
    A = CuArray{Float64}(undef, 1500, 1500, 1000)
    @threads for i in 1:size(A,3)
        A_ = view(A, :, :, i)
        A_ = nothing
    end

    @threads for _ in 1:nthreads()
         println(threadid())
         GC.gc()
         CUDA.reclaim()
    end
    return
end

However, using JULIA_CUDA_MEMORY_POOL=none does not work. The output is as follows. In particular, note that the effective GPU memory usage is still >17GB.

A
Effective GPU memory usage: 1.66% (401.688 MiB/23.643 GiB)
No memory pool is in use.
B
Effective GPU memory usage: 72.57% (17.158 GiB/23.643 GiB)
No memory pool is in use.
C
Effective GPU memory usage: 1.66% (401.688 MiB/23.643 GiB)
No memory pool is in use.
D
Effective GPU memory usage: 72.57% (17.158 GiB/23.643 GiB)
No memory pool is in use.
E
Effective GPU memory usage: 72.57% (17.158 GiB/23.643 GiB)

This looks like a bug to me. Do you agree, and if so, where do you suggest I report it?