Let’s say we have a situation where we run a for loop with “n” times. Every for loop contains CUDA kernels and CPU functions mixture. However, every operation is independent from each other.
for index = 1:n random_cpu_func(index) @cuda threads = 1 random_gpu_kernel() while some_condition @cuda threads = 1 random_gpu_kernel(index) random_cpu_func(index) x = A/b # usage of CUBLAS @cuda threads = 1 random_gpu_kernel(index) end random_cpu_func() end
I want to parallelize this for loop. I could not use dynamic parallelism for this job. One loop is a pretty complex job for a single GPU thread. That “n” parameter is not a huge number, maximum case is 100. Also, if I try to use dynamic parallelism, I get an error related to solving a linear system. Julia says that it cannot allocate memory. I think solving a linear system is forbidden inside a kernel. I wanted to use multiple CPU threads to accomplish this final step of my program.
Threads.@threads for index = 1:n random_cpu_func(index) @cuda threads = 1 random_gpu_kernel() while some_condition @cuda threads = 1 random_gpu_kernel(index) random_cpu_func(index) x = A/b # usage of CUBLAS @cuda threads = 1 random_gpu_kernel(index) end random_cpu_func() end
If I try to use Threading library instead, the loops do not execute concurrently. How can I make sure they execute at the same time? Does CUDA.jl automatically define different streams for different loops? One loop lasts 0.9 seconds approximately. So if they execute concurrently, I should get 0.9 (or slightly more than 0.9) seconds as a total time instead of n*0.9 seconds.
Should I divide the problem into batches? For example every batch executes 12 loops because my maximum number of threads is 12. If I want to use “n=13”, I also get an error.
Is there an example of threading? How can I do this?. I checked the CUDA.jl tests file and documents but cannot achieve threading. If someone can help, I will be so glad. Thank you.
(Note: There is a topic related to this but without answers and for an old version of CUDA.jl. The topic: Using stream per cpu thread pattern )