Performance help for first passage time simulation

Dan · March 21, 2023, 9:42pm

Looking at the 1:(Nstep-1) loop, isn’t it an optimization to break out of loop at first passage time? Why go on when both outputs rt and resp are fixed?

Avoiding branches is good, but up to a limit (in the extreme, a forever loop avoids all branches )

Bill_Holmes · March 21, 2023, 10:29pm

That’s correct. In working with these in fortran in the past, with a little adaptation and prediction, it is usually considerably faster to simulate more than needed and then determine when first passage happens. Though you are of course correct that it is a tradeoff between more ops and branching.

Dan · March 21, 2023, 11:25pm

Another optimization, might be to generate the random numbers on the fly. Accessing memory is very expensive, especially in the spread-out manner it is done within the Nstep loop. So, even calculating randn(3) could be faster. Furthermore, the modulus calculation to access the memory is slow, as it is a division operation - a known time-sink for CPUs. If it remains, it should be replaced with a bit-wise AND operation on a mask, and have the pre-calculated array be a power of 2 size.
Finally, since this is a simulation of stochastic process, the random step can be approximately normal (enough to have the right mean and variance). Summing a bunch of uniforms, might be enough and faster to generate (the reason is sum can be done on CPU and normal calculation uses a table - which eventually inhabits cache, but is still slow).

If you are still at this, measurement is the best way to decide between these alternatives.

Bill_Holmes · March 22, 2023, 12:15am

Thanks. Some were on my radar, others not. Great to get the input. Still figuring out the cost / benefit of different things here. I worked it all out in fortran, but while Julia is much friendlier, the pitfalls aren’t all the same. The modulus observation is one that I didn’t think of.

Oscar_Smith · March 22, 2023, 12:58am

In general Julia’s performance cost/benefits should be pretty similar to Fortran’s At the end of the day it all compiles down to CPU instructions. I think the difference here is that you are severely underestimating how fast rand (because we have very fast rand) so rand becomes cheaper than a lot of the alternatives.

Julia currently uses Xoshiro256++ for it’s random number generators and takes (on my 3.6 ghz zen2 machine) 3.3ns to generate a single Float64/UInt, but if you are requesting multiple random numbers at a time will be vectorized and only take 1.2ns per Float64/UInt. In other words, our rng has a latency of around 12 cycles and a throughput of 3 cycles per 64 bits of data. Also, the generator only has 256 bits of state, so we can give every task it’s own rng state which means that you can ask for random numbers from all of your threads without having data races.

Topic		Replies	Views
Enhancing Simulation Timing Reports with Color-Coded Outputs in Julia General Usage	2	142	February 28, 2024
Slow performance with Julia on optimization problems Performance jump , parallel , optimization , data_structures	2	1008	April 3, 2020
Applying performance tips in library Performance	0	318	July 28, 2021
Monte Carlo particle simulations are very slow Performance	9	974	May 11, 2020
Simulation of regression coefficients New to Julia first-steps , dataframes	7	565	June 29, 2021

Performance help for first passage time simulation

Related topics