CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm

moukann · April 20, 2022, 9:00am

Hello all,

Let’s say we have a situation where we run a for loop with “n” times. Every for loop contains CUDA kernels and CPU functions mixture. However, every operation is independent from each other.

for index = 1:n
	random_cpu_func(index)
	@cuda threads = 1 random_gpu_kernel() 
	while some_condition
		@cuda threads = 1 random_gpu_kernel(index) 
		random_cpu_func(index)
		x = A/b # usage of CUBLAS
		@cuda threads = 1 random_gpu_kernel(index) 
	end
	random_cpu_func()
end

I want to parallelize this for loop. I could not use dynamic parallelism for this job. One loop is a pretty complex job for a single GPU thread. That “n” parameter is not a huge number, maximum case is 100. Also, if I try to use dynamic parallelism, I get an error related to solving a linear system. Julia says that it cannot allocate memory. I think solving a linear system is forbidden inside a kernel. I wanted to use multiple CPU threads to accomplish this final step of my program.

Threads.@threads for index = 1:n
	random_cpu_func(index)
	@cuda threads = 1 random_gpu_kernel() 
	while some_condition
		@cuda threads = 1 random_gpu_kernel(index) 
		random_cpu_func(index)
		x = A/b # usage of CUBLAS
		@cuda threads = 1 random_gpu_kernel(index) 
	end
	random_cpu_func()
end

If I try to use Threading library instead, the loops do not execute concurrently. How can I make sure they execute at the same time? Does CUDA.jl automatically define different streams for different loops? One loop lasts 0.9 seconds approximately. So if they execute concurrently, I should get 0.9 (or slightly more than 0.9) seconds as a total time instead of n*0.9 seconds.
Should I divide the problem into batches? For example every batch executes 12 loops because my maximum number of threads is 12. If I want to use “n=13”, I also get an error.
Is there an example of threading? How can I do this?. I checked the CUDA.jl tests file and documents but cannot achieve threading. If someone can help, I will be so glad. Thank you.

(Note: There is a topic related to this but without answers and for an old version of CUDA.jl. The topic: Using stream per cpu thread pattern )

carstenbauer · April 20, 2022, 9:38am

Someone will hopefully take a closer look at your use case but until then, you should probably check out Tasks and threads · CUDA.jl if you haven’t already.

moukann · April 20, 2022, 9:49am

Thank you for your response but If I try to write something similar to this (@spawn, @sync, begin and end), Julia acts weird, and again, does not execute them concurrently. Sometimes it lies to me, It pretends like it finished the loop but it continues to execute lines although the program is complete. It is so strange. I could not solve this problem so I created this topic.

maleadt · April 26, 2022, 7:58am

Yeah, no. Stream concurrency is intended to parallelize coarse-grained operations and keep the GPU busy executing multiple kernels, but many CPU operations (like launching kernels) are still going to take locks and won’t execute in parallel. In general, using this mechanism to parallelize across many launches using a single thread each is very likely going to use your GPU very inefficiently. But more specifically, if you only use Julia tasks there’s not even multiple threads involved, and concurrent execution only happens when synchronizing the GPU (which you aren’t doing here). So at the very least you’ll need multiple tasks, but even then don’t expect this kind of scaling behavior.

That’s not very helpful (weird, liesa, pretends), and not something I can offer any useful reply for. Take a look at this JuliaCon talk, https://www.youtube.com/watch?v=fw0R5G8pB0U, where these concepts are demonstrated.

Topic		Replies	Views
Using stream per cpu thread pattern GPU	1	901	June 8, 2019
Multi-threaded calls to CUDA matrix multiplication GPU question , multithreading , cuda	5	833	August 13, 2023
Parallelization on the CPU isn't effective General Usage	19	541	November 19, 2021
Multi-threading on a 2 CPU system New to Julia multithreading	15	1083	February 2, 2023
Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations GPU question	2	428	April 12, 2023

CUDA.jl - Multiple Threads to Initiate Same CUDA Algorithm

Related topics