What about CuthillMcKee ordering?
If you ignore the scaling and represent the accumulation as a sparse matvec, Cuthill McKee gives a bandwidth-reducing permutation of the sparse matrix. It might improve data locality (though this isn’t what it’s designed for).