# Reordering for memory contiguity/memory access

I am simulating a “mass” that obeys a law of motion. The algorithm I use to discretize to a finite grid has the property that each cell is distributed to `K` others after being multiplied by something else. Formally, each x z_k is added to some y_{p_k} according to “patterns” `p` (ie collections of indices).

There is an MWE below using `SVector`, but that’s just for the purposes of an example, generally my type supports the relevant methods for `*`, `zero`, `+`. `K` is known statically.

This is not unlike a generalization of a sparse matrix multiplication, but with a very special index structure. I am looking for suggestions to improve it; specifically, is there a simple way to calculate a permutation of indexes in `xs`/`ys` so that the memory access for `ys` is improved?

``````using StatsBase, StaticArrays, BenchmarkTools

N = 100
K = 4
M = 3

patterns = [SVector{K}(sample(1:N, 4; replace = false)...) for _ in 1:N]
xs = rand(SVector{M,Float64}, N)
ys = similar(xs)
zs = rand(SVector{K,Float64})

function combine_forward!(ys, xs::Vector{T},
patterns::Vector{SVector{K,Int}},
zs::SVector{K})  where {T,K}
N = length(xs)
fill!(ys, zero(T))
@inbounds for (x, pattern) in zip(xs, patterns)
for (j, z) in zip(pattern, zs)
ys[j] += x * z
end
end
ys
end

@btime combine_forward!(ys, xs, patterns, zs);
``````

If you ignore the scaling and represent the accumulation as a sparse matvec, Cuthill McKee gives a bandwidth-reducing permutation of the sparse matrix. It might improve data locality (though this isn’t what it’s designed for).

2 Likes

Thanks, that could improve things (also, I did not realize that the proper term is data locality). I found this topic which mentions some packages implementing Cluthill-McKee, but I am unsure which one of those would make it easiest to extract the permutation itself.

Also, I am wondering if instead of specifying targets in `patterns`, I should instead specify “sources” — ie effectively the sparsity pattern of the transpose. That would allow me to parallelize the outer loop without worrying about concurrent writes.

2 Likes

The “source” pattern would certainly another option. Without knowing more about the sparsity of the access pattern, it’s hard to say which would be more effective for your problem.

I just experimented briefly with CuthillMcKee.jl, but the permutation is pretty easy to extract (and is only a small precomputation if the pattern is fixed in time). .

1 Like