Reordering for memory contiguity/memory access

I am simulating a “mass” that obeys a law of motion. The algorithm I use to discretize to a finite grid has the property that each cell is distributed to K others after being multiplied by something else. Formally, each x z_k is added to some y_{p_k} according to “patterns” p (ie collections of indices).

There is an MWE below using SVector, but that’s just for the purposes of an example, generally my type supports the relevant methods for *, zero, +. K is known statically.

This is not unlike a generalization of a sparse matrix multiplication, but with a very special index structure. I am looking for suggestions to improve it; specifically, is there a simple way to calculate a permutation of indexes in xs/ys so that the memory access for ys is improved?

using StatsBase, StaticArrays, BenchmarkTools

N = 100
K = 4
M = 3

patterns = [SVector{K}(sample(1:N, 4; replace = false)...) for _ in 1:N]
xs = rand(SVector{M,Float64}, N)
ys = similar(xs)
zs = rand(SVector{K,Float64})

function combine_forward!(ys, xs::Vector{T},
                          patterns::Vector{SVector{K,Int}},
                          zs::SVector{K})  where {T,K}
    N = length(xs)
    fill!(ys, zero(T))
    @inbounds for (x, pattern) in zip(xs, patterns)
        for (j, z) in zip(pattern, zs)
            ys[j] += x * z
        end
    end
    ys
end


@btime combine_forward!(ys, xs, patterns, zs);

What about CuthillMcKee ordering?

If you ignore the scaling and represent the accumulation as a sparse matvec, Cuthill McKee gives a bandwidth-reducing permutation of the sparse matrix. It might improve data locality (though this isn’t what it’s designed for).

2 Likes

Thanks, that could improve things (also, I did not realize that the proper term is data locality). I found this topic which mentions some packages implementing Cluthill-McKee, but I am unsure which one of those would make it easiest to extract the permutation itself.

Also, I am wondering if instead of specifying targets in patterns, I should instead specify “sources” — ie effectively the sparsity pattern of the transpose. That would allow me to parallelize the outer loop without worrying about concurrent writes.

2 Likes

The “source” pattern would certainly another option. Without knowing more about the sparsity of the access pattern, it’s hard to say which would be more effective for your problem.

I just experimented briefly with CuthillMcKee.jl, but the permutation is pretty easy to extract (and is only a small precomputation if the pattern is fixed in time). .

1 Like