I’m writing a code to generate Von Mises random numbers based on this method which is free of accept/reject.
It is based on this paper.
This issue appears to be with findall. Anyone can reproduce the behaviour by using this simple script.
To generate 500k random numbers Von Mises distributed, line 11 runs the whole algorithm and it take a several seconds (~40s). However, if I break that same function into two separate functions (lines 13 and 14 ), function _A runs very fast (0.06s) and function _B takes about 40s.
In particular, in function _B, line 198 which is index1 = CUDA.findall(r .<= p[1,:]) is what’s taking the chunk of time.
I also noticed that if I run line 14 (without running line 13), it takes about 0.02s. Which makes me think that perhaps when the CuArray p is generated in line 7, it is somehow cached, whereas when it is generate by function _A in line 13, it is not cached and that’s what creates this huge lag.
Can anyone point me in the right direction?
Thanks!