Multi-threaded calls to CUDA matrix multiplication

I am trying to call CUDA matrix multiplication using a multi-threaded for loop, but I am getting an error (Initialising Julia with julia -t8). My cpu has 8 cores (16 threads), and my GPU is NVIDIA GeForce GTX 1650 with 1024 cores and 4GB memory.

I ``copy" the matrix a_cpu and then modify as that is required in the main project, of which this is a simplified version.
I set the BLAS thread count to 1, so that each thread can access a different for-loop iteration and each BLAS operation is restricted to its calling thread. This speeds up the task significantly when I do the calculation entirely in the cpu.

using LinearAlgebra
using CUDA

nn = 4*50; #matrix size
a_cpu = rand(ComplexF64, nn, nn);
b_cpu = rand(ComplexF64, nn, nn);

nm = 1000; #counter

nbt = BLAS.get_num_threads();
BLAS.set_num_threads(1)
for bc = nm:-1:1
    println(bc)
    Threads.@threads for ab = 1:bc
        aa_cpu = copy(a_cpu);
        aa_cpu[2:3,4:5] = [1 2; 3 4];
        aa_gpu = cu(aa_cpu);
        b_gpu = cu(b_cpu);  
        c_gpu = aa_gpu*b_gpu;
        c_cpu .= Array(c_gpu) 
        CUDA.unsafe_free!(aa_gpu);
        CUDA.unsafe_free!(b_gpu);
        CUDA.unsafe_free!(c_gpu);
    end
end
BLAS.set_num_threads(nbt)

The error is very long, but the last part reads,

ERROR: LoadError: TaskFailedException

    nested task error: Out of GPU memory
    Effective GPU memory usage: 99.93% (3.812 GiB/3.815 GiB)
    Memory pool usage: 5.493 MiB (96.000 MiB reserved)
1 Like

Surprisingly, the single for-loop, but with a much larger number of iterations, works fine.

using LinearAlgebra
using CUDA

nn = 4*50;
a_cpu = rand(ComplexF64, nn, nn);
b_cpu = rand(ComplexF64, nn, nn);
c_cpu = similar(a_cpu);

nm1 = 500000;

nbt = BLAS.get_num_threads();
BLAS.set_num_threads(1)
Threads.@threads for ab = 1:nm1
    aa_cpu = copy(a_cpu);
    aa_cpu[2:3,4:5] = [1 2; 3 4];
    aa_gpu = cu(aa_cpu);
    b_gpu = cu(b_cpu);  
    c_gpu = aa_gpu*b_gpu;
    c_cpu .= Array(c_gpu) 
    CUDA.unsafe_free!(aa_gpu);
    CUDA.unsafe_free!(b_gpu);
    CUDA.unsafe_free!(c_gpu);
end
BLAS.set_num_threads(nbt)
1 Like

Are you sure you don’t have other processes using GPU memory? Check nvidia-smi.

1 Like

Thanks. I disabled hardware acceleration in my browsers and now it works fine. Side note: The performance improves noticeably for 2 threads, but beyond that, there’s no further improvement in my system.

Is it possible to use multiple GPU streams to multiply more than one pair of matrices in parallel?

Yes, that’s what happens when using multiple tasks, so you should be doing that already. But why do you expect this to scale arbitrarily? The amount of kernels that can execute in parallel is subject to hardware limitations; it’s better to scale the problem size if possible.

2 Likes