Looking at the 1:(Nstep-1) loop, isn’t it an optimization to break out of loop at first passage time? Why go on when both outputs rt and resp are fixed?
Avoiding branches is good, but up to a limit (in the extreme, a forever loop avoids all branches )
That’s correct. In working with these in fortran in the past, with a little adaptation and prediction, it is usually considerably faster to simulate more than needed and then determine when first passage happens. Though you are of course correct that it is a tradeoff between more ops and branching.
Another optimization, might be to generate the random numbers on the fly. Accessing memory is very expensive, especially in the spread-out manner it is done within the Nstep loop. So, even calculating randn(3) could be faster. Furthermore, the modulus calculation to access the memory is slow, as it is a division operation - a known time-sink for CPUs. If it remains, it should be replaced with a bit-wise AND operation on a mask, and have the pre-calculated array be a power of 2 size.
Finally, since this is a simulation of stochastic process, the random step can be approximately normal (enough to have the right mean and variance). Summing a bunch of uniforms, might be enough and faster to generate (the reason is sum can be done on CPU and normal calculation uses a table - which eventually inhabits cache, but is still slow).
If you are still at this, measurement is the best way to decide between these alternatives.
Thanks. Some were on my radar, others not. Great to get the input. Still figuring out the cost / benefit of different things here. I worked it all out in fortran, but while Julia is much friendlier, the pitfalls aren’t all the same. The modulus observation is one that I didn’t think of.
In general Julia’s performance cost/benefits should be pretty similar to Fortran’s At the end of the day it all compiles down to CPU instructions. I think the difference here is that you are severely underestimating how fast rand (because we have very fast rand) so rand becomes cheaper than a lot of the alternatives.
Julia currently uses Xoshiro256++ for it’s random number generators and takes (on my 3.6 ghz zen2 machine) 3.3ns to generate a single Float64/UInt, but if you are requesting multiple random numbers at a time will be vectorized and only take 1.2ns per Float64/UInt. In other words, our rng has a latency of around 12 cycles and a throughput of 3 cycles per 64 bits of data. Also, the generator only has 256 bits of state, so we can give every task it’s own rng state which means that you can ask for random numbers from all of your threads without having data races.