I’m relatively new to Julia but it seems fantastic! I am still learning optimization tricks and am hoping someone can help.

More details below, but my basic problem is that I have a simd type of problem; run 10k relatively fast and independent copies of the same stochastic simulation, and store their results. When I run these 10k simulations and don’t store, timing is 100 micro-sec. But when I try to store the output of each (two scalers for each simulation), it grinds to a halt at 100 mili-sec, a 1,000x slowdown just for storing the output (in pre-allocated and typed arrays).

$$$$$ Scientific problem of interest

I am specifically working on first passage time (FPT) problems. Basically, I need to simulate a system of 3 SDEs to determine which one reaches a threshold first. And I need to do it 10,000 times to generate a FPT distribution. Given a set of thee 1D SDEs (dx_i = v_i * dt + sig *dw) I need to simulate the first time at which any of these achieve x_i > threshold and record the time and the appropriate index (which i). That is, for each SDE sim, I need to record two scalers (t,i). I then need to do this 10k times and record 10k pairs (t,i). I do NOT need to store the simulation state history. It is a simple problem but I need it to be as fast as possible as this will be wrapped in Bayesian inference. As such, I have written a code (below) that generates random *enough* numbers for stochastic simulation and avoids branching (no if statements that will break simd).

$$$$$ Coding / performance problem

I have code that can compute 10k FPTs for a given system of SDE’s in about 100 micro-sec, as long as I don’t try to record (t,i) . If I try to record (t,i) for each of the 10k simulations, it slows down to 100 mili-sec, a 1000x slow down. To be clear, this slow down is not a result of calculating (t,i) 10k times, it is only a result of recording it in allocated vectors *rt*, *resp*. I’ve inserted a screenshot of the slow benchmark below (discourse won’t let me insert two images). The only difference between the slow and fast versions is that in the fast, the two lines (127,128) that store the results of each individual simulation in the pre-allocated structure (rt[sim_ind] = rt_keep) are commented out. So in the slow, 10k (t,i) pairs are computed and stored, and in the fast, 10k (t,i) pairs are computed but not stored. This leads to a 1,000x slowdown. Note that both versions have the same number of allocations (from benchmark).

I have tried extracting the loop that performs each single simulation and wrapping it in a “single_sim” function. I’ve then tried to different things with this (naive loop, loop with @inline @simd, use map, @. for loop fusion). None of them help. Also, this is not an allocation issue directly. The same allocations are used for both the slow (with storage) and fast (without storage) versions since the storage vectors are pre-allocated and typed.

Any thoughts you have would be a great help.

$$$$$ Screenshots

$$$$$ Code

```
using BenchmarkTools
using Random
using LoopVectorization
using ProfileVega
using Profile
# Set simulation parameters. v1 , v2 , v3 , a , sig = ();
params = (0.8 , 1.0 , 1.2 , 1.5 , 1.0);
Nsim = 10000; dt = 0.005; Nstep = 1000; # Initialize some simulation specifics
rv = randn(100000000) ; # Initialize a large number of random numbers.
function Simulate3(params,Nsim,Nstep,dt,rand_vec)
# This function simulates first passage times for a system of three independent stochastic accumulators (x1,x2,x3)
# defined by simple SDE's. Each accumulator is simulated with standard Euler-Mayuarama. When the first accumulator
# reaches the threshold, its index (1,2,3) and the time is recorded. In order to prevent branching, all simulations
# are run to a fixed end time independent of the time if first passage.
# Since random number generation is expensive, a vector of 10^9 randn numbers are pregenerated and passed to the
# function. In order to generate pseudo random sequences from this pre-generated vector (rand_vec), each simulation
# and each accumulator specify a random "start_index" and "stride" and indexes to be drawn from rand_vec are
# determined by ind = start_ind + step_num * stride where appropriate modular arithmetic is used to ensure
# these indeces stay within the index bounds of rand_vec.
# This code only needs to store the passage times and indeces, not the full state history of the SDEs.
rand_size = size(rand_vec)[1] # Extract the size of the stored random vector being used. Needed to use @inbounds
# Initialize output storage vectors. Will need to fill these in on each simulation iteration. These are returned
rt = Vector{Float64}(undef,Nsim)
resp = zeros(Int64,Nsim)
# # Initialize storage vectors. For this, will use push! to extend.
# rt = Vector{Float64}(undef,0)
# resp = Vector{Int64}(undef,0)
v1 , v2 , v3 , a , sig = params; # Extract parameters
sig_dt = sig * sqrt(dt) # Set the sigma*sqrt(dt) for the diffusion part of the sde
# Seed the starting index for each SDE to pull from rand_vec.
start_ind1 = rand(1:rand_size,Nsim); start_ind2 = rand(1:rand_size,Nsim); start_ind3 = rand(1:rand_size,Nsim)
# Seed the stride with which to move through the rand_vec. Ensure they are odd to reduce chance of index cycling.
stride1 = rand(1:100).*2 .- 1; stride2 = rand(1:100).*2 .- 1; stride3 = rand(1:100).*2 .- 1
##### LOOP OVER SIMULATIONS
@inbounds @simd for sim_ind = 1:Nsim # Loop over simulation number
xx1 = 0.0 # Initial condition for SDE simulations
xx2 = 0.0
xx3 = 0.0
bool = 0 # Initialize tracker for whether a boundary has been crossed
boolt = 0 # Book keeping tracker.
rt_keep = 100000.0 # Set dummy value in case not assigned. Float64.
resp_keep = 100000 # Set dummy value in case not assigned. Int64
stride_ind_params = (start_ind1[sim_ind] , start_ind2[sim_ind] , start_ind3[sim_ind] ,
stride1[sim_ind] , stride2[sim_ind] , stride3[sim_ind])
##### LOOP OVER SIMULATION TIME STEP
for i = 1:(Nstep-1) # Loop over simulation time step
si1 , si2 , si3 , st1 , st2 , st3 = stride_ind_params
t = dt*i # Extract current time
# The following code extracts random numbers from a pre-generated array of randn numbers.
# These numbers are extracted by indexing from a random start_index in rand_vec and then
# striding through the rand_vec with a randomly chosen stride.
# ind1 = mod(start_ind1[sim_ind] + i * stride1[sim_ind],rand_size)+1
# rval1 = rand_vec[ind1]
# ind2 = mod(start_ind2[sim_ind] + i * stride2[sim_ind],rand_size)+1
# rval2 = rand_vec[ind2]
# ind3 = mod(start_ind3[sim_ind] + i * stride3[sim_ind],rand_size)+1
# rval3 = rand_vec[ind3]
ind1 = mod(si1 + i * st1,rand_size)+1
rval1 = rand_vec[ind1]
ind2 = mod(si2 + i * st2,rand_size)+1
rval2 = rand_vec[ind2]
ind3 = mod(si3 + i * st3,rand_size)+1
rval3 = rand_vec[ind3]
# Simple euler-maruyama update.
xx1 = xx1 + v1*dt + sig_dt*rval1
xx2 = xx2 + v2*dt + sig_dt*rval2
xx3 = xx3 + v3*dt + sig_dt*rval3
##### All code BELOW is intended to prevent branching. #####
# These lines determine whether a boundary has been crossed by any of the 3 SDE's.
# If so, the code extracts which SDE did (xx1,xx2,xx3) and at what time.
# It then sets boolt = 1 if this has occured and boolt = 0 if not.
(rtt1,respt1,boolt1) = ifelse(xx1>a && bool==0,(t,1,1),(100000.0,100000,0))
(rtt2,respt2,boolt2) = ifelse(xx2>a && bool==0,(t,2,1),(100000.0,100000,0))
(rtt3,respt3,boolt3) = ifelse(xx3>a && bool==0,(t,3,1),(100000.0,100000,0))
rtt = min(rtt1,rtt2,rtt3)
respt = min(respt1,respt2,respt3)
boolt = max(boolt1,boolt2,boolt3)
# boolt only tells us whether one of the SDE's is "currently" above threshold.
# Here we make sure bool stays at 1 if passage has occoured at any prior time.
# We further store the original crossing time (in rt_keep) and the original sde index (in resp_keep)
bool = max(bool,boolt)
rt_keep = min(rtt,rt_keep)
resp_keep = min(respt,resp_keep)
##### All code ABOVE is intended to prevent branching. #####
end
########## THIS IS THE WHAT GRINDS THE CODE TO A HALT ##########
########## When uncommented, speed is 100 mili-sec. When commented, speed is 100 micro-sec ##########
# Store simulation output in pre-allocated vectors.
rt[sim_ind] = rt_keep # Store output
resp[sim_ind] = resp_keep # Store output
end
return resp , rt
end
@benchmark Simulate3(params,Nsim,Nstep,dt,rv)
```