Optimizing a simple random walk simulation

aris · January 10, 2025, 6:23pm

Hello everyone,

Suppose we have a simple random walk simulation, where a collection of walkers take a step towards a random direction at each time step. For each walker at each step, if some condition is met, the walker dies and is removed from the simulation.

I was trying to come up with some efficient julia code for this, and ended up with the following:

using LinearAlgebra, Random, StaticArrays

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    @time for i in 1:n_steps

        Threads.@threads for i in findall(alive_walkers)

            normalize!(randn!(steps[i])) # select random direction and normalize it
            @. steps[i] = steps[i] * step_size
            @. xyz[i] = xyz[i] + steps[i]

            if any(xyz[i] .>10)
                alive_walkers[i] = false
            end
        end
    end
end

I was wondering whether there would be any obvious ways to improve the code above and minimize the allocations.

stevengj · January 10, 2025, 6:31pm

No point in using @MVector here, I would just use @SVector, which should be faster.

Then get rid of the @. in your loop, just do:

xyz[i] += (steps[i] = normalize(@SVector(randn(3)) * step_size)

which should be fast and non-allocating for SVectors. (For SVector, you don’t have to worry about doing “in-place” operations, in the same way that you wouldn’t worry about in-place operations on scalars.)

Note that this allocates a new array for the result of findall on every outer-loop step. Probably faster to do:

Threads.@threads for i in 1:n_walkers
    alive_walkers[i] || continue 
    ...
end

stevengj · January 10, 2025, 6:43pm

You don’t even need to store the steps array at all, since you’re generating an entirely new step each time. Better to just ditch the array and do:

xyz[i] += normalize(@SVector(randn(3)) * step_size

unless you need to save the most recent step for some other purpose (and even then, you could save the step only on the last outer iteration, or only when the walker “dies”).

aris · January 10, 2025, 6:44pm

Thank you for the reply Steven,

I tried your version as follows:

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@SVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    @time for i in 1:n_steps

        Threads.@threads for i in 1:n_walkers

            alive_walkers[i] || continue 

            @views begin
                #=normalize!(randn!(steps[i])) # select random direction and normalize it=#
                #=@. steps[i] = steps[i] * step_size=#
                #=@. xyz[i] = xyz[i] + steps[i]=#
                xyz[i] += (steps[i] = normalize(@SVector(randn(3)) * step_size))

                if any(xyz[i] .>10)
                    alive_walkers[i] = false
                end
            end
        end
    end
end

And it gives 0.014276 seconds (42.00 k allocations: 4.272 MiB), while the previous version gives 0.011462 seconds (42.00 k allocations: 4.272 MiB), and that’s exactly the same.

Not sure whether I’m missing something here.

aris · January 10, 2025, 6:46pm

this method leads to 0.015833 seconds (343.39 k allocations: 13.470 MiB), which is pretty much the same.

stevengj · January 10, 2025, 6:46pm

Don’t use @views. Just use

xyz[i] += normalize(@SVector(randn(3)) * step_size
alive_walkers[i] = !any(xyz[i] .>10)

Also,

allocates a BitVector, which is slower to read and write element-wise. I would just allocate it as a Vector{Bool}:

alive_walkers = fill(true, length(xyz))

aris · January 10, 2025, 6:52pm

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    step_size = 4
    alive_walkers = fill(true, length(xyz))

    @time for i in 1:n_steps

        Threads.@threads for i in 1:n_walkers

            alive_walkers[i] || continue 

            xyz[i] += normalize(@SVector(randn(3)) * step_size)
            alive_walkers[i] = !any(xyz[i] .>10)
        end
    end
end

foo()

results in 0.010088 seconds (42.00 k allocations: 4.272 MiB)

I tested with or without the views and it doesn’t seem to make any difference.

Oh that’s interesting, didn’t know there would be any difference between these two.

stevengj · January 10, 2025, 6:55pm

I think your problem is too small to use threads, and your entire timing is dominated by the book-keeping necessary to set up and synchronize threads.

If I get rid of the @time and the @threads, and just run @btime foo() to time your entire function, it takes only 0.7ms. With @threads, it takes 16ms.

aris · January 10, 2025, 7:02pm

Thank you for the tips.

Indeed multithreading this one doesn’t do much, but this just the toy example.
Normally I’d like to run something similar for 1e8 walkers on 72 threads, where it would make a lot more sense.

Regarding the other tips above, removing the “steps” array does make the code more simple and elegant, but it seems the only bit which actually reduces allocations (from 44k to 42k) is this one

(please correct me if I’m wrong)

stevengj · January 10, 2025, 7:32pm

To the contrary, it does quite a lot — it completely distorts your benchmarking, because the threads overhead dominates your timing.

(Usually you want to optimize your single-threaded code before you worry much about threading, and then for threading you can focus on parallelism.)

aris · January 10, 2025, 7:36pm

Oh that makes sense!

Here are some numbers:

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            normalize!(randn!(steps[i])) # select random direction and normalize it
            @. steps[i] = steps[i] * step_size
            @. xyz[i] = xyz[i] + steps[i]

            if any(xyz[i] .>10)
                alive_walkers[i] = false
            end
        end
    end
end

function foo2()
    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    step_size = 4
    alive_walkers = fill(true, length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            xyz[i] += normalize(@SVector(randn(3)) * step_size)
            alive_walkers[i] = !any(xyz[i] .>10)
        end
    end
end

@btime foo() - function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            normalize!(randn!(steps[i])) # select random direction and normalize it
            @. steps[i] = steps[i] * step_size
            @. xyz[i] = xyz[i] + steps[i]

            if any(xyz[i] .>10)
                alive_walkers[i] = false
            end
        end
    end
end

function foo2()
    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    step_size = 4
    alive_walkers = fill(true, length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            xyz[i] += normalize(@SVector(randn(3)) * step_size)
            alive_walkers[i] = !any(xyz[i] .>10)
        end
    end
end

@btime foo() -   1.006 ms (2007 allocations: 78.41 KiB)

@btime foo2() -   3.535 ms (5 allocations: 24.59 KiB)

aris · January 10, 2025, 7:37pm

Not sure whether I fully understand this.
Is there a better way to make the code above run on multiple threads?

stevengj · January 10, 2025, 7:38pm

This is a bug. It should be

xyz[i] += normalize(@SVector(randn(3))) * step_size

After this fix, @btime foo(); @btime foo2(); gives me

  1.513 ms (2007 allocations: 78.41 KiB)
  1.265 ms (5 allocations: 24.59 KiB)

stevengj · January 10, 2025, 7:40pm

My only point is that you need to make single-threaded code fast before you worry about parallelism.

In this particular toy example, you won’t get any benefit from threads until you make it much larger.

aris · January 10, 2025, 7:41pm

That’s clear, I was just wondering whether I was missing something extra.
Thanks for clarifying.

sgaure · January 10, 2025, 7:46pm

I think the particular example might be a bit unfortunate in that there will hardly be any walkers left after 1000 steps. I.e. the inner loop becomes very short. This may (or may not, depending on the real program) make timings less relevant.

aris · January 10, 2025, 7:48pm

That’s the case for the real program as well. It should usually run until there are no walkers left.

aris · January 10, 2025, 7:50pm

I’m now a bit confused here. How would you save the step in case you have to take a step back if a condition is met?

Also, if there’s a random number involved in the condition for the walker to die, is there a better alternative to

if rand() < 0.1
    alive_walkers[i] = false
end

?
I was thinking it could be nice to repurpose the randn numbers from the step, but not sure whether converting randn to rand would be efficient enough.

abraemer · January 10, 2025, 9:25pm

I think you should reconsider your parallelization strategy in general. Since parallelization has overhead it usually is best to do it at the highest level possible. In your case all walkers are independent so I would recommend to keep foo() purely single threaded (and optimize as much as possible). Then to parallelize just have multiple threads each running their own foo() and merge the results afterwards. That also is a bit nicer in terms of code organisation because it separates the core logic/computation from the parallelization strategy.

Edit: also did you check that this approach really is faster than just simulating each particle on its own? Depending on how many n_walkers you have and how quickly they did, the current approach could be very inefficient because you need to scan the whole array in each step.

aris · January 10, 2025, 9:26pm

Makes sense, thanks for the tip!

Topic		Replies	Views
Simple random walk performance tips Performance	6	1399	May 21, 2021
Writing more optimal Julia code New to Julia	10	1789	August 10, 2017
Allocations when creating array Performance parallel , multithreading , memory-allocation , arrays	16	378	June 27, 2023
Specific help regarding allocations and performance Numerics	15	575	April 14, 2020
Performance help for first passage time simulation Performance	24	587	March 22, 2023

Optimizing a simple random walk simulation

Related topics