CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm

Hello all,

Let’s say we have a situation where we run a for loop with “n” times. Every for loop contains CUDA kernels and CPU functions mixture. However, every operation is independent from each other.

for index = 1:n
	random_cpu_func(index)
	@cuda threads = 1 random_gpu_kernel() 
	while some_condition
		@cuda threads = 1 random_gpu_kernel(index) 
		random_cpu_func(index)
		x = A/b # usage of CUBLAS
		@cuda threads = 1 random_gpu_kernel(index) 
	end
	random_cpu_func()
end

I want to parallelize this for loop. I could not use dynamic parallelism for this job. One loop is a pretty complex job for a single GPU thread. That “n” parameter is not a huge number, maximum case is 100. Also, if I try to use dynamic parallelism, I get an error related to solving a linear system. Julia says that it cannot allocate memory. I think solving a linear system is forbidden inside a kernel. I wanted to use multiple CPU threads to accomplish this final step of my program.

Threads.@threads for index = 1:n
	random_cpu_func(index)
	@cuda threads = 1 random_gpu_kernel() 
	while some_condition
		@cuda threads = 1 random_gpu_kernel(index) 
		random_cpu_func(index)
		x = A/b # usage of CUBLAS
		@cuda threads = 1 random_gpu_kernel(index) 
	end
	random_cpu_func()
end

If I try to use Threading library instead, the loops do not execute concurrently. How can I make sure they execute at the same time? Does CUDA.jl automatically define different streams for different loops? One loop lasts 0.9 seconds approximately. So if they execute concurrently, I should get 0.9 (or slightly more than 0.9) seconds as a total time instead of n*0.9 seconds.
Should I divide the problem into batches? For example every batch executes 12 loops because my maximum number of threads is 12. If I want to use “n=13”, I also get an error.
Is there an example of threading? How can I do this?. I checked the CUDA.jl tests file and documents but cannot achieve threading. If someone can help, I will be so glad. Thank you.

(Note: There is a topic related to this but without answers and for an old version of CUDA.jl. The topic: Using stream per cpu thread pattern )

Someone will hopefully take a closer look at your use case but until then, you should probably check out Tasks and threads · CUDA.jl if you haven’t already.

Thank you for your response but If I try to write something similar to this (@spawn, @sync, begin and end), Julia acts weird, and again, does not execute them concurrently. Sometimes it lies to me, It pretends like it finished the loop but it continues to execute lines although the program is complete. It is so strange. I could not solve this problem so I created this topic.

Yeah, no. Stream concurrency is intended to parallelize coarse-grained operations and keep the GPU busy executing multiple kernels, but many CPU operations (like launching kernels) are still going to take locks and won’t execute in parallel. In general, using this mechanism to parallelize across many launches using a single thread each is very likely going to use your GPU very inefficiently. But more specifically, if you only use Julia tasks there’s not even multiple threads involved, and concurrent execution only happens when synchronizing the GPU (which you aren’t doing here). So at the very least you’ll need multiple tasks, but even then don’t expect this kind of scaling behavior.

That’s not very helpful (weird, liesa, pretends), and not something I can offer any useful reply for. Take a look at this JuliaCon talk, CUDA.jl 3.0 | Tim Besard | JuliaCon2021 - YouTube, where these concepts are demonstrated.