GPU is slower than CPU for findall on a CuArray

jtoledom · May 12, 2025, 4:23am

I’m writing a code to generate Von Mises random numbers based on this method which is free of accept/reject.
It is based on this paper.

This issue appears to be with findall. Anyone can reproduce the behaviour by using this simple script.

To generate 500k random numbers Von Mises distributed, line 11 runs the whole algorithm and it take a several seconds (~40s). However, if I break that same function into two separate functions (lines 13 and 14 ), function _A runs very fast (0.06s) and function _B takes about 40s.

In particular, in function _B, line 198 which is index1 = CUDA.findall(r .<= p[1,:]) is what’s taking the chunk of time.

I also noticed that if I run line 14 (without running line 13), it takes about 0.02s. Which makes me think that perhaps when the CuArray p is generated in line 7, it is somehow cached, whereas when it is generate by function _A in line 13, it is not cached and that’s what creates this huge lag.

Can anyone point me in the right direction?

Thanks!

maleadt · May 12, 2025, 8:15am

GPU operations are asynchronous wrt. the CPU, so some may appear to take little time while others “soak up” all of the time spent in previous GPU operations. You should either use time operations individually by surrounding them using CUDA.@sync (forcing synchronization), or simply run your application under a GPU-aware profiler (e.g. CUDA.@profile) which will correctly report device-time spent. To add information about your CPU-side operations (e.g. the call to findall) to the trace, consider using NVTX.jl’s @range.

jtoledom · May 14, 2025, 12:02am

Using CUDA@sync and CUDA.@time I was able to pinpoint correctly the step in my algorithm that was chewing all the algorithm time. It wasn’t findall. After fixing it, the code generates Von Mises random numbers with random parameters on the order of 100ns.

Thnx @maleadt !

Topic		Replies	Views
Help Debugging GPU Performance Issue GPU gpu , debug , cuarrays	5	918	July 1, 2020
GPU randn way slower than rand? Performance gpu , cuda	6	1583	December 3, 2018
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	225	June 4, 2025
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2333	January 21, 2020
Why does GPU addition slows down as the array get larger compared to other methods? GPU performance	7	501	August 25, 2023

GPU is slower than CPU for findall on a CuArray

Related topics