I am trying to call CUDA matrix multiplication using a multi-threaded for loop, but I am getting an error (Initialising Julia with julia -t8). My cpu has 8 cores (16 threads), and my GPU is NVIDIA GeForce GTX 1650 with 1024 cores and 4GB memory.
I ``copy" the matrix a_cpu and then modify as that is required in the main project, of which this is a simplified version.
I set the BLAS thread count to 1, so that each thread can access a different for-loop iteration and each BLAS operation is restricted to its calling thread. This speeds up the task significantly when I do the calculation entirely in the cpu.
using LinearAlgebra using CUDA nn = 4*50; #matrix size a_cpu = rand(ComplexF64, nn, nn); b_cpu = rand(ComplexF64, nn, nn); nm = 1000; #counter nbt = BLAS.get_num_threads(); BLAS.set_num_threads(1) for bc = nm:-1:1 println(bc) Threads.@threads for ab = 1:bc aa_cpu = copy(a_cpu); aa_cpu[2:3,4:5] = [1 2; 3 4]; aa_gpu = cu(aa_cpu); b_gpu = cu(b_cpu); c_gpu = aa_gpu*b_gpu; c_cpu .= Array(c_gpu) CUDA.unsafe_free!(aa_gpu); CUDA.unsafe_free!(b_gpu); CUDA.unsafe_free!(c_gpu); end end BLAS.set_num_threads(nbt)
The error is very long, but the last part reads,
ERROR: LoadError: TaskFailedException nested task error: Out of GPU memory Effective GPU memory usage: 99.93% (3.812 GiB/3.815 GiB) Memory pool usage: 5.493 MiB (96.000 MiB reserved)