Perf help: Random Walks for BoundaryValueProblems

Hey all :wave:
I started working on some code for a class project and I’m proud to say I took it down from my professor’s 300s to sub 6s :smiley: . I am now trying to do the fastest serial version of it before making it parallel, and wanted to throw it here in case anyone has any ideas on any low hanging fruit.

Additionally, if people come up with good parallel GPU strategies that are smarter than a split iteration mapreduce, I’d be interested to hear about it.

The code solves a Boundary Value Problem for the diffusion equation on a rectangular grid by

  1. Starting a random walker in the interior of a grid (with known boundary values) that moves to North/East/South/West at every step in time and keeping track of how many times it visits each interior grid.
  2. Once the boundary is reached, a given function is evaluated at the boundary point and propagated back to all the points visited, as a mulitple of how many times it was visited by a random walker.
  3. Repeating this process until k random walkers have visited every point in the interior grid.

My prof’s approach was to

  1. push! into a vector the indexes that the random walker had visited, and then
  2. Visit the matrix one index at a time and update the corresponding entry
    I decided to go with the non-allocating matrix approach because I wanted to throgh some AVX512 at it and because I didn’t want to find the allocator when going parallel (The original simulation did 4.5G allocations and about 350GBs of total allocations o.0)

My code is available at

And the function which takes ~100% of the compute (and is now non-allocating) is here:

mutable struct RandomEnsemble{T,F} <: AbstractRandomWalkerBVP
    const f::F
    const bitmat::BitMatrix
    # TODO - make these subarrays
    #const view_bitmat::AbstractArray{S, 2}

function trajectory!(re::RandomEnsemble{T,F}, i, j) where {T,F}
    n, m = size(re.sol)
    !valid(re, i, j) && return

    top = i
    bot = i
    left = j
    right = j
    while valid(re, i, j)
        i, j = walk!(i, j)
        top = min(i, top)
        bot = max(i, bot)
        right = max(j, right)
        left = min(j, left)
        @inbounds re.temp_walkers[i, j] += 1

    scalar = re.f(i, j)
    myzero = zero(eltype(T))
    @turbo for j in left:right
        for i in top:bot
            re.walkers[i, j] += re.temp_walkers[i, j]
            re.sol[i, j] += re.temp_walkers[i, j] * scalar
            re.temp_walkers[i, j] = myzero

The one optimization which I managed to squeeze in so far is to keep track of the rectangular region the walker has visited (that’s what the top/bot and min(i, top) is doing) so that only the necessary elements are being updated on the matrix traversal.
Note: The function assumes that zero’d buffers are given to it re.temp_walkers and thus must zero it out again at the end.

To run the benchmark yourself, you will need some local data, the script here

should setup everything accordingly.

1 Like

Right now, I believe all your time will be in the random walk part because changes to i will change the row of temp_walkers which will be a cache miss. If instead, you change your data structure to a blocked layout (e.g. store a matrix of 8x8 blocks), you will be able to walk in both directions without dramatically changing memory location.

You might be able to get a similar effect with prefetching.

1 Like

Nice, blocking and prefetching were next on my list (maybe swapping out the RNG for another one), so it’s nice to hear some confirmation along those lines. Thanks!