Optimizing a simple random walk simulation

Hello everyone,

Suppose we have a simple random walk simulation, where a collection of walkers take a step towards a random direction at each time step. For each walker at each step, if some condition is met, the walker dies and is removed from the simulation.

I was trying to come up with some efficient julia code for this, and ended up with the following:

using LinearAlgebra, Random, StaticArrays

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    @time for i in 1:n_steps

        Threads.@threads for i in findall(alive_walkers)

            normalize!(randn!(steps[i])) # select random direction and normalize it
            @. steps[i] = steps[i] * step_size
            @. xyz[i] = xyz[i] + steps[i]

            if any(xyz[i] .>10)
                alive_walkers[i] = false
            end
        end
    end
end

I was wondering whether there would be any obvious ways to improve the code above and minimize the allocations.

No point in using @MVector here, I would just use @SVector, which should be faster.

Then get rid of the @. in your loop, just do:

xyz[i] += (steps[i] = normalize(@SVector(randn(3)) * step_size)

which should be fast and non-allocating for SVectors. (For SVector, you don’t have to worry about doing “in-place” operations, in the same way that you wouldn’t worry about in-place operations on scalars.)

Note that this allocates a new array for the result of findall on every outer-loop step. Probably faster to do:

Threads.@threads for i in 1:n_walkers
    alive_walkers[i] || continue 
    ...
end
1 Like

You don’t even need to store the steps array at all, since you’re generating an entirely new step each time. Better to just ditch the array and do:

xyz[i] += normalize(@SVector(randn(3)) * step_size

unless you need to save the most recent step for some other purpose (and even then, you could save the step only on the last outer iteration, or only when the walker “dies”).

Thank you for the reply Steven,

I tried your version as follows:

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@SVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    @time for i in 1:n_steps

        Threads.@threads for i in 1:n_walkers

            alive_walkers[i] || continue 

            @views begin
                #=normalize!(randn!(steps[i])) # select random direction and normalize it=#
                #=@. steps[i] = steps[i] * step_size=#
                #=@. xyz[i] = xyz[i] + steps[i]=#
                xyz[i] += (steps[i] = normalize(@SVector(randn(3)) * step_size))

                if any(xyz[i] .>10)
                    alive_walkers[i] = false
                end
            end
        end
    end
end

And it gives 0.014276 seconds (42.00 k allocations: 4.272 MiB), while the previous version gives 0.011462 seconds (42.00 k allocations: 4.272 MiB), and that’s exactly the same.

Not sure whether I’m missing something here.

this method leads to 0.015833 seconds (343.39 k allocations: 13.470 MiB), which is pretty much the same.

Don’t use @views. Just use

xyz[i] += normalize(@SVector(randn(3)) * step_size
alive_walkers[i] = !any(xyz[i] .>10)

Also,

allocates a BitVector, which is slower to read and write element-wise. I would just allocate it as a Vector{Bool}:

alive_walkers = fill(true, length(xyz))
1 Like
function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    step_size = 4
    alive_walkers = fill(true, length(xyz))

    @time for i in 1:n_steps

        Threads.@threads for i in 1:n_walkers

            alive_walkers[i] || continue 

            xyz[i] += normalize(@SVector(randn(3)) * step_size)
            alive_walkers[i] = !any(xyz[i] .>10)
        end
    end
end

foo()

results in 0.010088 seconds (42.00 k allocations: 4.272 MiB)

I tested with or without the views and it doesn’t seem to make any difference.

Oh that’s interesting, didn’t know there would be any difference between these two.

I think your problem is too small to use threads, and your entire timing is dominated by the book-keeping necessary to set up and synchronize threads.

If I get rid of the @time and the @threads, and just run @btime foo() to time your entire function, it takes only 0.7ms. With @threads, it takes 16ms.

Thank you for the tips.

Indeed multithreading this one doesn’t do much, but this just the toy example.
Normally I’d like to run something similar for 1e8 walkers on 72 threads, where it would make a lot more sense.

Regarding the other tips above, removing the “steps” array does make the code more simple and elegant, but it seems the only bit which actually reduces allocations (from 44k to 42k) is this one

(please correct me if I’m wrong)

To the contrary, it does quite a lot — it completely distorts your benchmarking, because the threads overhead dominates your timing.

(Usually you want to optimize your single-threaded code before you worry much about threading, and then for threading you can focus on parallelism.)

2 Likes

Oh that makes sense!

Here are some numbers:

function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            normalize!(randn!(steps[i])) # select random direction and normalize it
            @. steps[i] = steps[i] * step_size
            @. xyz[i] = xyz[i] + steps[i]

            if any(xyz[i] .>10)
                alive_walkers[i] = false
            end
        end
    end
end

function foo2()
    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    step_size = 4
    alive_walkers = fill(true, length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            xyz[i] += normalize(@SVector(randn(3)) * step_size)
            alive_walkers[i] = !any(xyz[i] .>10)
        end
    end
end

@btime foo() - function foo()

    n_walkers = 1000
    n_steps = 1000
    xyz = [@MVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    steps = [@MVector zeros(3) for _ in 1:n_walkers]; # direction of step
    step_size = 4
    alive_walkers = trues(length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            normalize!(randn!(steps[i])) # select random direction and normalize it
            @. steps[i] = steps[i] * step_size
            @. xyz[i] = xyz[i] + steps[i]

            if any(xyz[i] .>10)
                alive_walkers[i] = false
            end
        end
    end
end

function foo2()
    n_walkers = 1000
    n_steps = 1000
    xyz = [@SVector zeros(3) for _ in 1:n_walkers]; # Coordinates of each walker
    step_size = 4
    alive_walkers = fill(true, length(xyz))

    for i in 1:n_steps

        for i in 1:n_walkers

            alive_walkers[i] || continue 

            xyz[i] += normalize(@SVector(randn(3)) * step_size)
            alive_walkers[i] = !any(xyz[i] .>10)
        end
    end
end

@btime foo() -   1.006 ms (2007 allocations: 78.41 KiB)

@btime foo2() -   3.535 ms (5 allocations: 24.59 KiB)

Not sure whether I fully understand this.
Is there a better way to make the code above run on multiple threads?

This is a bug. It should be

xyz[i] += normalize(@SVector(randn(3))) * step_size

After this fix, @btime foo(); @btime foo2(); gives me

  1.513 ms (2007 allocations: 78.41 KiB)
  1.265 ms (5 allocations: 24.59 KiB)
1 Like

My only point is that you need to make single-threaded code fast before you worry about parallelism.

In this particular toy example, you won’t get any benefit from threads until you make it much larger.

1 Like

That’s clear, I was just wondering whether I was missing something extra.
Thanks for clarifying.

I think the particular example might be a bit unfortunate in that there will hardly be any walkers left after 1000 steps. I.e. the inner loop becomes very short. This may (or may not, depending on the real program) make timings less relevant.

That’s the case for the real program as well. It should usually run until there are no walkers left.

I’m now a bit confused here. How would you save the step in case you have to take a step back if a condition is met?

Also, if there’s a random number involved in the condition for the walker to die, is there a better alternative to

if rand() < 0.1
    alive_walkers[i] = false
end

?
I was thinking it could be nice to repurpose the randn numbers from the step, but not sure whether converting randn to rand would be efficient enough.

I think you should reconsider your parallelization strategy in general. Since parallelization has overhead it usually is best to do it at the highest level possible. In your case all walkers are independent so I would recommend to keep foo() purely single threaded (and optimize as much as possible). Then to parallelize just have multiple threads each running their own foo() and merge the results afterwards. That also is a bit nicer in terms of code organisation because it separates the core logic/computation from the parallelization strategy.

Edit: also did you check that this approach really is faster than just simulating each particle on its own? Depending on how many n_walkers you have and how quickly they did, the current approach could be very inefficient because you need to scan the whole array in each step.

1 Like

Makes sense, thanks for the tip!