Performance help for first passage time simulation

I’m relatively new to Julia but it seems fantastic! I am still learning optimization tricks and am hoping someone can help.

More details below, but my basic problem is that I have a simd type of problem; run 10k relatively fast and independent copies of the same stochastic simulation, and store their results. When I run these 10k simulations and don’t store, timing is 100 micro-sec. But when I try to store the output of each (two scalers for each simulation), it grinds to a halt at 100 mili-sec, a 1,000x slowdown just for storing the output (in pre-allocated and typed arrays).

$$$$$ Scientific problem of interest
I am specifically working on first passage time (FPT) problems. Basically, I need to simulate a system of 3 SDEs to determine which one reaches a threshold first. And I need to do it 10,000 times to generate a FPT distribution. Given a set of thee 1D SDEs (dx_i = v_i * dt + sig *dw) I need to simulate the first time at which any of these achieve x_i > threshold and record the time and the appropriate index (which i). That is, for each SDE sim, I need to record two scalers (t,i). I then need to do this 10k times and record 10k pairs (t,i). I do NOT need to store the simulation state history. It is a simple problem but I need it to be as fast as possible as this will be wrapped in Bayesian inference. As such, I have written a code (below) that generates random enough numbers for stochastic simulation and avoids branching (no if statements that will break simd).

$$$$$ Coding / performance problem
I have code that can compute 10k FPTs for a given system of SDE’s in about 100 micro-sec, as long as I don’t try to record (t,i) . If I try to record (t,i) for each of the 10k simulations, it slows down to 100 mili-sec, a 1000x slow down. To be clear, this slow down is not a result of calculating (t,i) 10k times, it is only a result of recording it in allocated vectors rt, resp. I’ve inserted a screenshot of the slow benchmark below (discourse won’t let me insert two images). The only difference between the slow and fast versions is that in the fast, the two lines (127,128) that store the results of each individual simulation in the pre-allocated structure (rt[sim_ind] = rt_keep) are commented out. So in the slow, 10k (t,i) pairs are computed and stored, and in the fast, 10k (t,i) pairs are computed but not stored. This leads to a 1,000x slowdown. Note that both versions have the same number of allocations (from benchmark).

I have tried extracting the loop that performs each single simulation and wrapping it in a “single_sim” function. I’ve then tried to different things with this (naive loop, loop with @inline @simd, use map, @. for loop fusion). None of them help. Also, this is not an allocation issue directly. The same allocations are used for both the slow (with storage) and fast (without storage) versions since the storage vectors are pre-allocated and typed.

Any thoughts you have would be a great help.

$$$$$ Screenshots

$$$$$ Code

using BenchmarkTools
using Random
using LoopVectorization
using ProfileVega
using Profile


# Set simulation parameters.    v1 , v2 , v3 , a , sig = ();
params = (0.8 , 1.0 , 1.2 , 1.5 , 1.0);

Nsim = 10000;   dt = 0.005; Nstep = 1000; # Initialize some simulation specifics

rv = randn(100000000) ; # Initialize a large number of random numbers.


function Simulate3(params,Nsim,Nstep,dt,rand_vec)

    # This function simulates first passage times for a system of three independent stochastic accumulators (x1,x2,x3)
    # defined by simple SDE's. Each accumulator is simulated with standard Euler-Mayuarama. When the first accumulator
    # reaches the threshold, its index (1,2,3) and the time is recorded. In order to prevent branching, all simulations 
    # are run to a fixed end time independent of the time if first passage.

    # Since random number generation is expensive, a vector of 10^9 randn numbers are pregenerated and passed to the
    # function. In order to generate pseudo random sequences from this pre-generated vector (rand_vec), each simulation
    # and each accumulator specify a random "start_index" and "stride" and indexes to be drawn from rand_vec are
    # determined by     ind = start_ind + step_num * stride     where appropriate modular arithmetic is used to ensure
    # these indeces stay within the index bounds of rand_vec.

    # This code only needs to store the passage times and indeces, not the full state history of the SDEs.


    rand_size = size(rand_vec)[1] # Extract the size of the stored random vector being used. Needed to use @inbounds

    # Initialize output storage vectors. Will need to fill these in on each simulation iteration. These are returned
    rt = Vector{Float64}(undef,Nsim)
    resp = zeros(Int64,Nsim)

    # # Initialize storage vectors. For this, will use push! to extend. 
    # rt = Vector{Float64}(undef,0)
    # resp = Vector{Int64}(undef,0)

    v1 , v2 , v3 , a , sig = params; # Extract parameters

    sig_dt = sig * sqrt(dt) # Set the sigma*sqrt(dt) for the diffusion part of the sde

    # Seed the starting index for each SDE to pull from rand_vec.
    start_ind1 = rand(1:rand_size,Nsim);    start_ind2 = rand(1:rand_size,Nsim);    start_ind3 = rand(1:rand_size,Nsim)

    # Seed the stride with which to move through the rand_vec. Ensure they are odd to reduce chance of index cycling.
    stride1 = rand(1:100).*2 .- 1;  stride2 = rand(1:100).*2 .- 1;  stride3 = rand(1:100).*2 .- 1 

    ##### LOOP OVER SIMULATIONS
    @inbounds @simd for sim_ind = 1:Nsim    # Loop over simulation number

        xx1 = 0.0 # Initial condition for SDE simulations
        xx2 = 0.0
        xx3 = 0.0

        bool = 0            # Initialize tracker for whether a boundary has been crossed
        boolt = 0           # Book keeping tracker.
        rt_keep = 100000.0    # Set dummy value in case not assigned. Float64.
        resp_keep = 100000    # Set dummy value in case not assigned. Int64

        stride_ind_params = (start_ind1[sim_ind] , start_ind2[sim_ind] , start_ind3[sim_ind] ,
                            stride1[sim_ind] , stride2[sim_ind] , stride3[sim_ind])

        ##### LOOP OVER SIMULATION TIME STEP
        for i = 1:(Nstep-1) # Loop over simulation time step

            si1 , si2 , si3 , st1 , st2 , st3 = stride_ind_params

            t = dt*i    # Extract current time

            # The following code extracts random numbers from a pre-generated array of randn numbers.
            # These numbers are extracted by indexing from a random start_index in rand_vec and then
            # striding through the rand_vec with a randomly chosen stride.
            # ind1 = mod(start_ind1[sim_ind] + i * stride1[sim_ind],rand_size)+1
            # rval1 = rand_vec[ind1]

            # ind2 = mod(start_ind2[sim_ind] + i * stride2[sim_ind],rand_size)+1
            # rval2 = rand_vec[ind2]

            # ind3 = mod(start_ind3[sim_ind] + i * stride3[sim_ind],rand_size)+1
            # rval3 = rand_vec[ind3]

            ind1 = mod(si1 + i * st1,rand_size)+1
            rval1 = rand_vec[ind1]

            ind2 = mod(si2 + i * st2,rand_size)+1
            rval2 = rand_vec[ind2]

            ind3 = mod(si3 + i * st3,rand_size)+1
            rval3 = rand_vec[ind3]

            # Simple euler-maruyama update.
            xx1 = xx1 + v1*dt + sig_dt*rval1
            xx2 = xx2 + v2*dt + sig_dt*rval2
            xx3 = xx3 + v3*dt + sig_dt*rval3

            ##### All code BELOW is intended to prevent branching. #####

            # These lines determine whether a boundary has been crossed by any of the 3 SDE's.
            # If so, the code extracts which SDE did (xx1,xx2,xx3) and at what time.
            # It then sets boolt = 1 if this has occured and boolt = 0 if not.
            (rtt1,respt1,boolt1) = ifelse(xx1>a && bool==0,(t,1,1),(100000.0,100000,0))
            (rtt2,respt2,boolt2) = ifelse(xx2>a && bool==0,(t,2,1),(100000.0,100000,0))
            (rtt3,respt3,boolt3) = ifelse(xx3>a && bool==0,(t,3,1),(100000.0,100000,0))

            rtt = min(rtt1,rtt2,rtt3)
            respt = min(respt1,respt2,respt3)
            boolt = max(boolt1,boolt2,boolt3)

            # boolt only tells us whether one of the SDE's is "currently" above threshold.
            # Here we make sure bool stays at 1 if passage has occoured at any prior time.
            # We further store the original crossing time (in rt_keep) and the original sde index (in resp_keep)
            bool = max(bool,boolt)
            rt_keep = min(rtt,rt_keep)
            resp_keep = min(respt,resp_keep)

            ##### All code ABOVE is intended to prevent branching. #####

        end

        ########## THIS IS THE WHAT GRINDS THE CODE TO A HALT ##########
        ########## When uncommented, speed is 100 mili-sec. When commented, speed is 100 micro-sec ##########
        # Store simulation output in pre-allocated vectors.
        rt[sim_ind] = rt_keep        # Store output
        resp[sim_ind] = resp_keep    # Store output
    end

    
    return resp , rt
end



@benchmark Simulate3(params,Nsim,Nstep,dt,rv)

Does the same phenomenon happen when you put dollar signs to interpolate global variables during benchmark? See Manual · BenchmarkTools.jl

This has no effect (just put the $ in for good measure). Four of the “Simulate3” inputs are immutable and the fourth is vector. Inserting the $ does reduce 1 allocation, but does not affect timing.

This is expected. The arguments are being passed in the same way in both the “slow” and “fast” versions. The only thing that is different is that one computes but does not store the results while the other computes AND stores. The storage is leading to a 2,000x slow down even though I’m only storing 1 Float 64 and 1 Int64 on each of 10k simulations. Seems like asking for storage is serializing the loop over independent simulations. Not sure though.

I wonder whether rand_vec is creating excessive memory demands. I’ve observed performance fall off a cliff with really large vectors. I recommend generating values “on the fly” to eliminate memory pressure. For example, you could do the following:

xx1 = xx1 + v1*dt + sig_dt*randn()
xx2 = xx2 + v2*dt + sig_dt*randn()
xx3 = xx3 + v3*dt + sig_dt*randn()

I think you could do something similar with this line of code:

      stride_ind_params = (start_ind1[sim_ind] , start_ind2[sim_ind] , start_ind3[sim_ind] ,
                          stride1[sim_ind] , stride2[sim_ind] , stride3[sim_ind])

As far as I can tell, it is not necessary to maintain all values in the start_ind vectors. So you might be able to generate them on the fly.

In general, preallocating large vectors is a good strategy for a language like Matlab because it maximizes the time spent in optimized C++ libraries, but its not a good strategy for Julia (unless you need to keep all of the values) because it requires unnecessary memory allocations.

I suspect that will help some. Another thing you can do is use @code_warntype to identify type-instabilities—cases in which the type changes or the compiler fails to infer the type. Type instabilities prevent the compiler from optimizing the code. If you try those and report back, I (and/or others) can try to help identify more improvements.

Thanks for the suggestion. Two things.

  1. The particular problem I work on requires generating sequences of random numbers. If you generate them on the fly as you mention, you need 10^12 - 10^13 of them, which is very slow. A common trick, which I use here, is to generate 10^9 and just draw sequences from that pre-allocated random vector. In this way, I can draw 10^3 fewer numbers while still getting sufficiently random sequences for my purpose.

  2. This is not the bottle neck. i) If you reduce the size of that random vector, performance is unchanged. ii) The “fast” and “slow” versions, which are 2,000x different, differ only in the lines

rt[sim_ind] = rt_keep        # Store output
resp[sim_ind] = resp_keep    # Store output

When you comment these book-keeping lines out, the code is 2,000x faster (200 micro-sec versus 400 mili-sec). So this iterative storage of simulated outputs is the bottleneck, not random number generation or actual simulation specifics.

Is it possible that the compiler is just being smart when you comment out those lines? It looks like if those lines are commented, then the entire big loop becomes irrelevant for the return value of the function, so maybe it just skips it entirely?

I removed the use of the rand_vec and observed a large speed up.

Simulate3

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  718.478 ms … 830.564 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     803.505 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   797.043 ms ±  36.392 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁                                         ▁   █   ▁  ▁      ▁  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁█▁▁█▁▁▁▁▁▁█ ▁
  718 ms           Histogram: frequency by time          831 ms <

Simulate3a

BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  138.931 ms … 142.801 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     139.602 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   139.873 ms ± 944.747 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▃▃█ █ █  █ █            ▃                                   
  █▇▁███▁█▇█▁▇█▇█▇▁▇▇▁▁▁▁▁▁▁▁█▁▁▁▁▇▁▁▁▁▇▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▇ ▁
  139 ms           Histogram: frequency by time          143 ms <

 Memory estimate: 390.89 KiB, allocs estimate: 11.

Please verify in the code below that I did not introduce an error.

Updated Code
function Simulate3a(params,Nsim,Nstep,dt)

    # This function simulates first passage times for a system of three independent stochastic accumulators (x1,x2,x3)
    # defined by simple SDE's. Each accumulator is simulated with standard Euler-Mayuarama. When the first accumulator
    # reaches the threshold, its index (1,2,3) and the time is recorded. In order to prevent branching, all simulations 
    # are run to a fixed end time independent of the time if first passage.

    # Since random number generation is expensive, a vector of 10^9 randn numbers are pregenerated and passed to the
    # function. In order to generate pseudo random sequences from this pre-generated vector (rand_vec), each simulation
    # and each accumulator specify a random "start_index" and "stride" and indexes to be drawn from rand_vec are
    # determined by     ind = start_ind + step_num * stride     where appropriate modular arithmetic is used to ensure
    # these indeces stay within the index bounds of rand_vec.

    # This code only needs to store the passage times and indeces, not the full state history of the SDEs.


    rand_size = 100000000 # Extract the size of the stored random vector being used. Needed to use @inbounds

    # Initialize output storage vectors. Will need to fill these in on each simulation iteration. These are returned
    rt = Vector{Float64}(undef,Nsim)
    resp = zeros(Int64,Nsim)

    # # Initialize storage vectors. For this, will use push! to extend. 
    # rt = Vector{Float64}(undef,0)
    # resp = Vector{Int64}(undef,0)

    v1 , v2 , v3 , a , sig = params; # Extract parameters

    sig_dt = sig * sqrt(dt) # Set the sigma*sqrt(dt) for the diffusion part of the sde

    # Seed the starting index for each SDE to pull from rand_vec.
    start_ind1 = rand(1:rand_size,Nsim);    start_ind2 = rand(1:rand_size,Nsim);    start_ind3 = rand(1:rand_size,Nsim)

    # Seed the stride with which to move through the rand_vec. Ensure they are odd to reduce chance of index cycling.
    stride1 = rand(1:100).*2 .- 1;  stride2 = rand(1:100).*2 .- 1;  stride3 = rand(1:100).*2 .- 1 

    ##### LOOP OVER SIMULATIONS
    @inbounds @simd for sim_ind = 1:Nsim    # Loop over simulation number

        xx1 = 0.0 # Initial condition for SDE simulations
        xx2 = 0.0
        xx3 = 0.0

        bool = 0            # Initialize tracker for whether a boundary has been crossed
        boolt = 0           # Book keeping tracker.
        rt_keep = 100000.0    # Set dummy value in case not assigned. Float64.
        resp_keep = 100000    # Set dummy value in case not assigned. Int64

        stride_ind_params = (start_ind1[sim_ind] , start_ind2[sim_ind] , start_ind3[sim_ind] ,
                            stride1[sim_ind] , stride2[sim_ind] , stride3[sim_ind])

        ##### LOOP OVER SIMULATION TIME STEP
        for i = 1:(Nstep-1) # Loop over simulation time step

            si1 , si2 , si3 , st1 , st2 , st3 = stride_ind_params

            t = dt*i    # Extract current time

            # The following code extracts random numbers from a pre-generated array of randn numbers.
            # These numbers are extracted by indexing from a random start_index in rand_vec and then
            # striding through the rand_vec with a randomly chosen stride.
            # ind1 = mod(start_ind1[sim_ind] + i * stride1[sim_ind],rand_size)+1
            # rval1 = rand_vec[ind1]

            # ind2 = mod(start_ind2[sim_ind] + i * stride2[sim_ind],rand_size)+1
            # rval2 = rand_vec[ind2]

            # ind3 = mod(start_ind3[sim_ind] + i * stride3[sim_ind],rand_size)+1
            # rval3 = rand_vec[ind3]

            ind1 = mod(si1 + i * st1,rand_size)+1

            ind2 = mod(si2 + i * st2,rand_size)+1

            ind3 = mod(si3 + i * st3,rand_size)+1

            # Simple euler-maruyama update.
            xx1 = xx1 + v1*dt + sig_dt*randn()
            xx2 = xx2 + v2*dt + sig_dt*randn()
            xx3 = xx3 + v3*dt + sig_dt*randn()

            ##### All code BELOW is intended to prevent branching. #####

            # These lines determine whether a boundary has been crossed by any of the 3 SDE's.
            # If so, the code extracts which SDE did (xx1,xx2,xx3) and at what time.
            # It then sets boolt = 1 if this has occured and boolt = 0 if not.
            (rtt1,respt1,boolt1) = ifelse(xx1>a && bool==0,(t,1,1),(100000.0,100000,0))
            (rtt2,respt2,boolt2) = ifelse(xx2>a && bool==0,(t,2,1),(100000.0,100000,0))
            (rtt3,respt3,boolt3) = ifelse(xx3>a && bool==0,(t,3,1),(100000.0,100000,0))

            rtt = min(rtt1,rtt2,rtt3)
            respt = min(respt1,respt2,respt3)
            boolt = max(boolt1,boolt2,boolt3)

            # boolt only tells us whether one of the SDE's is "currently" above threshold.
            # Here we make sure bool stays at 1 if passage has occoured at any prior time.
            # We further store the original crossing time (in rt_keep) and the original sde index (in resp_keep)
            bool = max(bool,boolt)
            rt_keep = min(rtt,rt_keep)
            resp_keep = min(respt,resp_keep)

            ##### All code ABOVE is intended to prevent branching. #####

        end

        ########## THIS IS THE WHAT GRINDS THE CODE TO A HALT ##########
        ########## When uncommented, speed is 100 mili-sec. When commented, speed is 100 micro-sec ##########
        # Store simulation output in pre-allocated vectors.
        rt[sim_ind] = rt_keep        # Store output
        resp[sim_ind] = resp_keep    # Store output
    end

    
    return resp , rt
end

Also, the use of @inbounds @simd does not appear to be helping performance. I’m not sure when those are useful.

That‘s one FPT in 10ns. It sounds like your compiled function is basically a no-op.

1 Like

Looks like that may be the case. I’ve commented in and out some code and I think you are right, the compiler may just be smart enough to skip it some inner loops of code. Lesson learned! That doesn’t happen with my Fortran compiler, though I don’t use any of the optimization options on it.

1 Like

Julia by default runs at -O2 while many Fortran compilers are -O0 by default.

Yes, this is correct code. Just replaces random number shuffling with random number generation. Coming from using fortran and more recently cython, this is unexpected. In those the cost of random number generation was larger than memory read. In Julia, at least with how I have it coded, the cost of reading from a pre-generated random vector is larger than generating random numbers on the fly.

To your second point, you are correct, @simd and @inbounds don’t appear to be doing much. The compiler is probably handling @simd and the bounds checking are probably too little to gain much here.

Thanks Oscar. I’ve worked in compiled languages much of my career, but don’t know the least bit about compilers themselves. It didn’t occur to me that the Julia’s complier would be taking that much liberty. As I said before, lesson learned. And thanks!

I’m not sure how Fortran and Cython work. It would be interesting to know why they are faster for that coding style.

One thing to check is whether the updated version scales with the number of simulations (if you tend to do more than 10,000). Its possible that one scales better than the other and performance might depend on input size.

Particularly since it seems that this can be trivially parallelized on Nsim, not having to access the same array of precomputed numbers can be very advantageous.

Those are all additive noise SDEs, so why not just use SRA1 or a high weak order method (SDE Solvers · DifferentialEquations.jl) and then cut the number of simulations by a good order of magnitude due to faster convergence?

Also, this is super amenable to GPU acceleration, and GitHub - SciML/DiffEqGPU.jl: GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem does the high weak order with GPU.

Then the other thing to do is a variance reduction technique like a Multiscale Monte Carlo. There are just so many algorithmic improvements to do here that hyper-optimizing an expensive way to solve will not get you close to something optimal.

The purpose here is to generate a simulated likelihood function for the stochastic first passage time problem (the probability that any one of N SDE’s will reach an absorbing boundary first and at what time).

Using a better time stepping algorithm and increasing time step size is the next step. But since I’m coming over from fortran, I wanted to try something simple first to get some understanding of Julia. I cannot cut the number of simulations though as that affects the quality of likelihood function construction. This is a kernel density estimation issue, not a simulation issue.

I’m not familiar with how multi-scale Monte Carlo will help here. The SDE’s that I am thinking about are pretty vanilla. The trick is not necessarily solving them, but solving them many times per parameter set to generated first passage time distributions, and then embedding that into MCMC for fitting and inference.

Regarding GPU, yes Im hoping Julia’s tools will help with that, though haven’t gotten into it yet.

Thanks all for the input. Am having to learn a lot of new lessons with Julia.

Sorry, multilevel. This is what the Multilevel Monte Carlo methods help with. If you’re estimating moments, they use a coupling order to reduce the variance on the estimators, thus allowing less samples to be required to get an accurate estimation of moments.

That makes more sense. However I’m not trying to estimate the moments of a distribution, but rather the full distribution. Moments don’t usually provide a set of sufficient statistics unless you use a lot of them. As a result, when used within parameter estimation, they can lead to biased parameter estimates. Density estimation doesn’t have this issue and has quantifiable bias and variance properties. Unless there have been some updates I’ve missed. I’ll check out the refs you sent.

BTW, where is the source code for the great solvers within DifferentialEquations.jl? I’ve already looked through the src files on GitHub, but I’m assuming they’re being called from a different package. I’m hoping to eventually use them as a base for solving the FPT of interest. Since they can likely be readily used (though not directly called).

The solvers are in StochasticDiffEq.jl