Perf help: Random Walks for BoundaryValueProblems

miguelraz · March 14, 2023, 7:45pm

Hey all
I started working on some code for a class project and I’m proud to say I took it down from my professor’s 300s to sub 6s . I am now trying to do the fastest serial version of it before making it parallel, and wanted to throw it here in case anyone has any ideas on any low hanging fruit.

Additionally, if people come up with good parallel GPU strategies that are smarter than a split iteration mapreduce, I’d be interested to hear about it.

The code solves a Boundary Value Problem for the diffusion equation on a rectangular grid by

Starting a random walker in the interior of a grid (with known boundary values) that moves to North/East/South/West at every step in time and keeping track of how many times it visits each interior grid.
Once the boundary is reached, a given function is evaluated at the boundary point and propagated back to all the points visited, as a mulitple of how many times it was visited by a random walker.
Repeating this process until k random walkers have visited every point in the interior grid.

My prof’s approach was to

push! into a vector the indexes that the random walker had visited, and then
Visit the matrix one index at a time and update the corresponding entry
I decided to go with the non-allocating matrix approach because I wanted to throgh some AVX512 at it and because I didn’t want to find the allocator when going parallel (The original simulation did 4.5G allocations and about 350GBs of total allocations o.0)

My code is available at

And the function which takes ~100% of the compute (and is now non-allocating) is here:

mutable struct RandomEnsemble{T,F} <: AbstractRandomWalkerBVP
    const f::F
    const bitmat::BitMatrix
    # TODO - make these subarrays
    #const view_bitmat::AbstractArray{S, 2}
    walkers::Matrix{T}
    temp_walkers::Matrix{T}
    sol::Matrix{T}
end

function trajectory!(re::RandomEnsemble{T,F}, i, j) where {T,F}
    n, m = size(re.sol)
    !valid(re, i, j) && return

    top = i
    bot = i
    left = j
    right = j
    while valid(re, i, j)
        i, j = walk!(i, j)
        top = min(i, top)
        bot = max(i, bot)
        right = max(j, right)
        left = min(j, left)
        @inbounds re.temp_walkers[i, j] += 1
    end

    scalar = re.f(i, j)
    myzero = zero(eltype(T))
    @turbo for j in left:right
        for i in top:bot
            re.walkers[i, j] += re.temp_walkers[i, j]
            re.sol[i, j] += re.temp_walkers[i, j] * scalar
            re.temp_walkers[i, j] = myzero
        end
    end
end

The one optimization which I managed to squeeze in so far is to keep track of the rectangular region the walker has visited (that’s what the top/bot and min(i, top) is doing) so that only the necessary elements are being updated on the matrix traversal.
Note: The function assumes that zero’d buffers are given to it re.temp_walkers and thus must zero it out again at the end.

To run the benchmark yourself, you will need some local data, the script here

github.com

miguelraz/RandomWalkBVP.jl/blob/main/test/runtests.jl

using RandomWalkBVP
using HDF5
using Random
using Test

#using Tullio
#using Folds
#using Random

#@testset "RandomWalkBVP" showtiming=true begin
@testset "Construction" showtiming = true begin
    mat = h5read("hdf5mat", "placa")
    f(x, y) = 1 / (y + 1)
    re = RandomEnsemble(mat, f)
    @test all(==(0), re.temp_walkers)
    @test all(==(0), re.sol)
    @test all(==(0), re.walkers)
end
@testset "Basic functions" showtiming = true begin

This file has been truncated. show original

should setup everything accordingly.

Oscar_Smith · March 14, 2023, 8:11pm

Right now, I believe all your time will be in the random walk part because changes to i will change the row of temp_walkers which will be a cache miss. If instead, you change your data structure to a blocked layout (e.g. store a matrix of 8x8 blocks), you will be able to walk in both directions without dramatically changing memory location.

You might be able to get a similar effect with prefetching.

miguelraz · March 14, 2023, 8:27pm

Nice, blocking and prefetching were next on my list (maybe swapping out the RNG for another one), so it’s nice to hear some confirmation along those lines. Thanks!

Topic		Replies	Views
Optimizing a simple random walk simulation Performance	41	458	January 13, 2025
Simple random walk performance tips Performance	6	1400	May 21, 2021
Parallel implementation is not very effective General Usage question , performance , parallel , multithreading	19	790	October 8, 2022
Diffusion by using CUDA.jl, DifferentialEquations and DiffEqGPU GPU question , diffeq	6	1235	March 29, 2022
How to solve implicitly a 2D Heat diffusion equation / ParallelStencil / DifferentialEquation / Modelling & Simulations question	17	2386	May 13, 2022

Perf help: Random Walks for BoundaryValueProblems

Related topics