GPU is slower than CPU for findall on a CuArray

I’m writing a code to generate Von Mises random numbers based on this method which is free of accept/reject.
It is based on this paper.

This issue appears to be with findall. Anyone can reproduce the behaviour by using this simple script.

To generate 500k random numbers Von Mises distributed, line 11 runs the whole algorithm and it take a several seconds (~40s). However, if I break that same function into two separate functions (lines 13 and 14 ), function _A runs very fast (0.06s) and function _B takes about 40s.

In particular, in function _B, line 198 which is index1 = CUDA.findall(r .<= p[1,:]) is what’s taking the chunk of time.

I also noticed that if I run line 14 (without running line 13), it takes about 0.02s. Which makes me think that perhaps when the CuArray p is generated in line 7, it is somehow cached, whereas when it is generate by function _A in line 13, it is not cached and that’s what creates this huge lag.

Can anyone point me in the right direction?

Thanks!

GPU operations are asynchronous wrt. the CPU, so some may appear to take little time while others “soak up” all of the time spent in previous GPU operations. You should either use time operations individually by surrounding them using CUDA.@sync (forcing synchronization), or simply run your application under a GPU-aware profiler (e.g. CUDA.@profile) which will correctly report device-time spent. To add information about your CPU-side operations (e.g. the call to findall) to the trace, consider using NVTX.jl’s @range.

3 Likes

Using CUDA@sync and CUDA.@time I was able to pinpoint correctly the step in my algorithm that was chewing all the algorithm time. It wasn’t findall. After fixing it, the code generates Von Mises random numbers with random parameters on the order of 100ns.

Thnx @maleadt !

1 Like