Reordering for memory contiguity/memory access

I am simulating a “mass” that obeys a law of motion. The algorithm I use to discretize to a finite grid has the property that each cell is distributed to K others after being multiplied by something else. Formally, each x z_k is added to some y_{p_k} according to “patterns” p (ie collections of indices).

There is an MWE below using SVector, but that’s just for the purposes of an example, generally my type supports the relevant methods for *, zero, +. K is known statically.

This is not unlike a generalization of a sparse matrix multiplication, but with a very special index structure. I am looking for suggestions to improve it; specifically, is there a simple way to calculate a permutation of indexes in xs/ys so that the memory access for ys is improved?

using StatsBase, StaticArrays, BenchmarkTools

N = 100
K = 4
M = 3

patterns = [SVector{K}(sample(1:N, 4; replace = false)...) for _ in 1:N]
xs = rand(SVector{M,Float64}, N)
ys = similar(xs)
zs = rand(SVector{K,Float64})

function combine_forward!(ys, xs::Vector{T},
                          patterns::Vector{SVector{K,Int}},
                          zs::SVector{K})  where {T,K}
    N = length(xs)
    fill!(ys, zero(T))
    @inbounds for (x, pattern) in zip(xs, patterns)
        for (j, z) in zip(pattern, zs)
            ys[j] += x * z
        end
    end
    ys
end


@btime combine_forward!(ys, xs, patterns, zs);

What about CuthillMcKee ordering?

If you ignore the scaling and represent the accumulation as a sparse matvec, Cuthill McKee gives a bandwidth-reducing permutation of the sparse matrix. It might improve data locality (though this isn’t what it’s designed for).

Thanks, that could improve things (also, I did not realize that the proper term is data locality). I found this topic which mentions some packages implementing Cluthill-McKee, but I am unsure which one of those would make it easiest to extract the permutation itself.

Also, I am wondering if instead of specifying targets in patterns, I should instead specify “sources” — ie effectively the sparsity pattern of the transpose. That would allow me to parallelize the outer loop without worrying about concurrent writes.

The “source” pattern would certainly another option. Without knowing more about the sparsity of the access pattern, it’s hard to say which would be more effective for your problem.

I just experimented briefly with CuthillMcKee.jl, but the permutation is pretty easy to extract (and is only a small precomputation if the pattern is fixed in time). .