I’m writing a code to generate Von Mises random numbers based on this method which is free of accept/reject.
It is based on this paper.
This issue appears to be with findall
. Anyone can reproduce the behaviour by using this simple script.
To generate 500k random numbers Von Mises distributed, line 11 runs the whole algorithm and it take a several seconds (~40s). However, if I break that same function into two separate functions (lines 13 and 14 ), function _A runs very fast (0.06s) and function _B takes about 40s.
In particular, in function _B, line 198 which is index1 = CUDA.findall(r .<= p[1,:])
is what’s taking the chunk of time.
I also noticed that if I run line 14 (without running line 13), it takes about 0.02s. Which makes me think that perhaps when the CuArray p
is generated in line 7, it is somehow cached, whereas when it is generate by function _A in line 13, it is not cached and that’s what creates this huge lag.
Can anyone point me in the right direction?
Thanks!