Then you should definitely just run each walker separately, i.e. swap the order of your loops. You don’t even need arrays of walkers, just an array to store when they died that is updated at the end of their loop.
e.g.
function foo3()
n_walkers = 1000
n_steps = 1000
step_size = 4
n_died = zeros(Int, n_steps) # number that died on step i
for _ in 1:n_walkers
xyz = @SVector zeros(3)
for i in 1:n_steps
xyz += normalize(@SVector(randn(3))) * step_size
if any(xyz .> 10)
n_died[i] += 1
break
end
end
end
return n_died
end
(note that the number alive on each step is simply n_walkers .- cumsum(n_died) at the end). This is about 2x faster than foo2() on my machine. (Update: sorry, not 1000x, I misread the numbers.)
(To parallelize this, which would only be worthwhile if you vastly increased the number of walkers, you would need to synchonize the n_died[i] += 1 statement. This isn’t too expensive since it only occurs at the end of the inner loop. You could either have a single n_died array and put a mutex lock around the increment, or have per-thread n_died arrays and then sum them together at the end. Or use MPI.jl and do an allreduce on n_died at the end.)