Contiguous Read Non-Contiguous Write vs Non-Contiguous Read Contigous Write Performance

For an out-of-place transpose like this, to get good cache-line utilization you want to do neither order: you generally want to “tile” the loops, either by tuning to your cache or by using a cache-oblivious algorithm.

(Optimizing transposition is a heavily studied problem, with a fair amount of literature and code out there if you search.)

See also e.g. Function on matrix transpose and performance - #4 by stevengj and the links in that thread.

2 Likes