Suppose we have a simple random walk simulation, where a collection of walkers take a step towards a random direction at each time step. For each walker at each step, if some condition is met, the walker dies and is removed from the simulation.
I was trying to come up with some efficient julia code for this, and ended up with the following:
using LinearAlgebra, Random, StaticArrays
function foo()
n_walkers = 1000
n_steps = 1000
xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
step_size = 4
alive_walkers = trues(length(xyz))
@time for i in 1:n_steps
Threads.@threads for i in findall(alive_walkers)
normalize!(randn!(steps[i])) # select random direction and normalize it
@. steps[i] = steps[i] * step_size
@. xyz[i] = xyz[i] + steps[i]
if any(xyz[i] .>10)
alive_walkers[i] = false
end
end
end
end
I was wondering whether there would be any obvious ways to improve the code above and minimize the allocations.
which should be fast and non-allocating for SVectors. (For SVector, you don’t have to worry about doing “in-place” operations, in the same way that you wouldn’t worry about in-place operations on scalars.)
Note that this allocates a new array for the result of findall on every outer-loop step. Probably faster to do:
Threads.@threads for i in 1:n_walkers
alive_walkers[i] || continue
...
end
You don’t even need to store the steps array at all, since you’re generating an entirely new step each time. Better to just ditch the array and do:
unless you need to save the most recent step for some other purpose (and even then, you could save the step only on the last outer iteration, or only when the walker “dies”).
function foo()
n_walkers = 1000
n_steps = 1000
xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
steps = [@SVector zeros(3) for _ in 1:n_walkers]; # direction of step
step_size = 4
alive_walkers = trues(length(xyz))
@time for i in 1:n_steps
Threads.@threads for i in 1:n_walkers
alive_walkers[i] || continue
@views begin
#=normalize!(randn!(steps[i])) # select random direction and normalize it=#
#=@. steps[i] = steps[i] * step_size=#
#=@. xyz[i] = xyz[i] + steps[i]=#
xyz[i] += (steps[i] = normalize(@SVector(randn(3)) * step_size))
if any(xyz[i] .>10)
alive_walkers[i] = false
end
end
end
end
end
And it gives 0.014276 seconds (42.00 k allocations: 4.272 MiB), while the previous version gives 0.011462 seconds (42.00 k allocations: 4.272 MiB), and that’s exactly the same.
function foo()
n_walkers = 1000
n_steps = 1000
xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
step_size = 4
alive_walkers = fill(true, length(xyz))
@time for i in 1:n_steps
Threads.@threads for i in 1:n_walkers
alive_walkers[i] || continue
xyz[i] += normalize(@SVector(randn(3)) * step_size)
alive_walkers[i] = !any(xyz[i] .>10)
end
end
end
foo()
results in 0.010088 seconds (42.00 k allocations: 4.272 MiB)
I tested with or without the views and it doesn’t seem to make any difference.
Oh that’s interesting, didn’t know there would be any difference between these two.
Indeed multithreading this one doesn’t do much, but this just the toy example.
Normally I’d like to run something similar for 1e8 walkers on 72 threads, where it would make a lot more sense.
Regarding the other tips above, removing the “steps” array does make the code more simple and elegant, but it seems the only bit which actually reduces allocations (from 44k to 42k) is this one
function foo()
n_walkers = 1000
n_steps = 1000
xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
step_size = 4
alive_walkers = trues(length(xyz))
for i in 1:n_steps
for i in 1:n_walkers
alive_walkers[i] || continue
normalize!(randn!(steps[i])) # select random direction and normalize it
@. steps[i] = steps[i] * step_size
@. xyz[i] = xyz[i] + steps[i]
if any(xyz[i] .>10)
alive_walkers[i] = false
end
end
end
end
function foo2()
n_walkers = 1000
n_steps = 1000
xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
step_size = 4
alive_walkers = fill(true, length(xyz))
for i in 1:n_steps
for i in 1:n_walkers
alive_walkers[i] || continue
xyz[i] += normalize(@SVector(randn(3)) * step_size)
alive_walkers[i] = !any(xyz[i] .>10)
end
end
end
@btime foo() - function foo()
n_walkers = 1000
n_steps = 1000
xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
step_size = 4
alive_walkers = trues(length(xyz))
for i in 1:n_steps
for i in 1:n_walkers
alive_walkers[i] || continue
normalize!(randn!(steps[i])) # select random direction and normalize it
@. steps[i] = steps[i] * step_size
@. xyz[i] = xyz[i] + steps[i]
if any(xyz[i] .>10)
alive_walkers[i] = false
end
end
end
end
function foo2()
n_walkers = 1000
n_steps = 1000
xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
step_size = 4
alive_walkers = fill(true, length(xyz))
for i in 1:n_steps
for i in 1:n_walkers
alive_walkers[i] || continue
xyz[i] += normalize(@SVector(randn(3)) * step_size)
alive_walkers[i] = !any(xyz[i] .>10)
end
end
end
@btime foo() - 1.006 ms (2007 allocations: 78.41 KiB)
@btime foo2() - 3.535 ms (5 allocations: 24.59 KiB)
I think the particular example might be a bit unfortunate in that there will hardly be any walkers left after 1000 steps. I.e. the inner loop becomes very short. This may (or may not, depending on the real program) make timings less relevant.
I’m now a bit confused here. How would you save the step in case you have to take a step back if a condition is met?
Also, if there’s a random number involved in the condition for the walker to die, is there a better alternative to
if rand() < 0.1
alive_walkers[i] = false
end
?
I was thinking it could be nice to repurpose the randn numbers from the step, but not sure whether converting randn to rand would be efficient enough.
I think you should reconsider your parallelization strategy in general. Since parallelization has overhead it usually is best to do it at the highest level possible. In your case all walkers are independent so I would recommend to keep foo() purely single threaded (and optimize as much as possible). Then to parallelize just have multiple threads each running their own foo() and merge the results afterwards. That also is a bit nicer in terms of code organisation because it separates the core logic/computation from the parallelization strategy.
Edit: also did you check that this approach really is faster than just simulating each particle on its own? Depending on how many n_walkers you have and how quickly they did, the current approach could be very inefficient because you need to scan the whole array in each step.