Hey all
I started working on some code for a class project and I’m proud to say I took it down from my professor’s 300s to sub 6s . I am now trying to do the fastest serial version of it before making it parallel, and wanted to throw it here in case anyone has any ideas on any low hanging fruit.
Additionally, if people come up with good parallel GPU strategies
that are smarter than a split iteration mapreduce, I’d be interested to hear about it.
The code solves a Boundary Value Problem for the diffusion equation on a rectangular grid by
 Starting a random walker in the interior of a grid (with known boundary values) that moves to North/East/South/West at every step in time and keeping track of how many times it visits each interior grid.
 Once the boundary is reached, a given function is evaluated at the boundary point and propagated back to all the points visited, as a mulitple of how many times it was visited by a random walker.
 Repeating this process until
k
random walkers have visited every point in the interior grid.
My prof’s approach was to

push!
into a vector the indexes that the random walker had visited, and then  Visit the matrix one index at a time and update the corresponding entry
I decided to go with the nonallocating matrix approach because I wanted to throgh some AVX512 at it and because I didn’t want to find the allocator when going parallel (The original simulation did 4.5G allocations and about 350GBs of total allocations o.0)
My code is available at
And the function which takes ~100% of the compute (and is now nonallocating) is here:
mutable struct RandomEnsemble{T,F} <: AbstractRandomWalkerBVP
const f::F
const bitmat::BitMatrix
# TODO  make these subarrays
#const view_bitmat::AbstractArray{S, 2}
walkers::Matrix{T}
temp_walkers::Matrix{T}
sol::Matrix{T}
end
function trajectory!(re::RandomEnsemble{T,F}, i, j) where {T,F}
n, m = size(re.sol)
!valid(re, i, j) && return
top = i
bot = i
left = j
right = j
while valid(re, i, j)
i, j = walk!(i, j)
top = min(i, top)
bot = max(i, bot)
right = max(j, right)
left = min(j, left)
@inbounds re.temp_walkers[i, j] += 1
end
scalar = re.f(i, j)
myzero = zero(eltype(T))
@turbo for j in left:right
for i in top:bot
re.walkers[i, j] += re.temp_walkers[i, j]
re.sol[i, j] += re.temp_walkers[i, j] * scalar
re.temp_walkers[i, j] = myzero
end
end
end
The one optimization which I managed to squeeze in so far is to keep track of the rectangular region the walker has visited (that’s what the top/bot
and min(i, top)
is doing) so that only the necessary elements are being updated on the matrix traversal.
Note: The function assumes that zero’d buffers are given to it re.temp_walkers
and thus must zero it out again at the end.
To run the benchmark yourself, you will need some local data, the script here
should setup everything accordingly.