Multi-threaded calls to CUDA matrix multiplication

eveningsilverfox · August 12, 2023, 7:36pm

I am trying to call CUDA matrix multiplication using a multi-threaded for loop, but I am getting an error (Initialising Julia with julia -t8). My cpu has 8 cores (16 threads), and my GPU is NVIDIA GeForce GTX 1650 with 1024 cores and 4GB memory.

I ``copy" the matrix a_cpu and then modify as that is required in the main project, of which this is a simplified version.
I set the BLAS thread count to 1, so that each thread can access a different for-loop iteration and each BLAS operation is restricted to its calling thread. This speeds up the task significantly when I do the calculation entirely in the cpu.

using LinearAlgebra
using CUDA

nn = 4*50; #matrix size
a_cpu = rand(ComplexF64, nn, nn);
b_cpu = rand(ComplexF64, nn, nn);

nm = 1000; #counter

nbt = BLAS.get_num_threads();
BLAS.set_num_threads(1)
for bc = nm:-1:1
    println(bc)
    Threads.@threads for ab = 1:bc
        aa_cpu = copy(a_cpu);
        aa_cpu[2:3,4:5] = [1 2; 3 4];
        aa_gpu = cu(aa_cpu);
        b_gpu = cu(b_cpu);  
        c_gpu = aa_gpu*b_gpu;
        c_cpu .= Array(c_gpu) 
        CUDA.unsafe_free!(aa_gpu);
        CUDA.unsafe_free!(b_gpu);
        CUDA.unsafe_free!(c_gpu);
    end
end
BLAS.set_num_threads(nbt)

The error is very long, but the last part reads,

ERROR: LoadError: TaskFailedException

    nested task error: Out of GPU memory
    Effective GPU memory usage: 99.93% (3.812 GiB/3.815 GiB)
    Memory pool usage: 5.493 MiB (96.000 MiB reserved)

eveningsilverfox · August 12, 2023, 9:10pm

Surprisingly, the single for-loop, but with a much larger number of iterations, works fine.

using LinearAlgebra
using CUDA

nn = 4*50;
a_cpu = rand(ComplexF64, nn, nn);
b_cpu = rand(ComplexF64, nn, nn);
c_cpu = similar(a_cpu);

nm1 = 500000;

nbt = BLAS.get_num_threads();
BLAS.set_num_threads(1)
Threads.@threads for ab = 1:nm1
    aa_cpu = copy(a_cpu);
    aa_cpu[2:3,4:5] = [1 2; 3 4];
    aa_gpu = cu(aa_cpu);
    b_gpu = cu(b_cpu);  
    c_gpu = aa_gpu*b_gpu;
    c_cpu .= Array(c_gpu) 
    CUDA.unsafe_free!(aa_gpu);
    CUDA.unsafe_free!(b_gpu);
    CUDA.unsafe_free!(c_gpu);
end
BLAS.set_num_threads(nbt)

maleadt · August 13, 2023, 8:21am

Are you sure you don’t have other processes using GPU memory? Check nvidia-smi.

eveningsilverfox · August 13, 2023, 10:40am

Thanks. I disabled hardware acceleration in my browsers and now it works fine. Side note: The performance improves noticeably for 2 threads, but beyond that, there’s no further improvement in my system.

eveningsilverfox · August 13, 2023, 2:33pm

Is it possible to use multiple GPU streams to multiply more than one pair of matrices in parallel?

maleadt · August 13, 2023, 5:58pm

Yes, that’s what happens when using multiple tasks, so you should be doing that already. But why do you expect this to scale arbitrarily? The amount of kernels that can execute in parallel is subject to hardware limitations; it’s better to scale the problem size if possible.

Topic		Replies	Views
CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm GPU parallel , multithreading , cuda , concurrency	3	1751	April 26, 2022
Using stream per cpu thread pattern GPU	1	901	June 8, 2019
Matrix vector multiplication Performance question	4	909	September 27, 2020
Multithreading using more CPUs than expected Performance	11	545	July 20, 2023
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4171	January 21, 2022

Multi-threaded calls to CUDA matrix multiplication

Related topics