I am trying to call CUDA matrix multiplication using a multi-threaded for loop, but I am getting an error (Initialising Julia with julia -t8). My cpu has 8 cores (16 threads), and my GPU is NVIDIA GeForce GTX 1650 with 1024 cores and 4GB memory.
I ``copy" the matrix a_cpu and then modify as that is required in the main project, of which this is a simplified version.
I set the BLAS thread count to 1, so that each thread can access a different for-loop iteration and each BLAS operation is restricted to its calling thread. This speeds up the task significantly when I do the calculation entirely in the cpu.
using LinearAlgebra
using CUDA
nn = 4*50; #matrix size
a_cpu = rand(ComplexF64, nn, nn);
b_cpu = rand(ComplexF64, nn, nn);
nm = 1000; #counter
nbt = BLAS.get_num_threads();
BLAS.set_num_threads(1)
for bc = nm:-1:1
println(bc)
Threads.@threads for ab = 1:bc
aa_cpu = copy(a_cpu);
aa_cpu[2:3,4:5] = [1 2; 3 4];
aa_gpu = cu(aa_cpu);
b_gpu = cu(b_cpu);
c_gpu = aa_gpu*b_gpu;
c_cpu .= Array(c_gpu)
CUDA.unsafe_free!(aa_gpu);
CUDA.unsafe_free!(b_gpu);
CUDA.unsafe_free!(c_gpu);
end
end
BLAS.set_num_threads(nbt)
The error is very long, but the last part reads,
ERROR: LoadError: TaskFailedException
nested task error: Out of GPU memory
Effective GPU memory usage: 99.93% (3.812 GiB/3.815 GiB)
Memory pool usage: 5.493 MiB (96.000 MiB reserved)