function permutedims_custom(Y,X)
for idx in range(1,size(Y,2))
@simd for idy in range(1,size(Y,1))
@inbounds Y[idy,idx] = X[idx,idy]
end
end
end
function permutedims_custom_2(Y,X)
for idx in range(1,size(Y,1))
@simd for idy in range(1,size(Y,2))
@inbounds Y[idx,idy] = X[idy,idx]
end
end
end
In the above case permutedims_custom outperforms permutedims_custom_2 by a fair margin (3x). The Contiguous write consistently outperforms the other method (even if I put mild customizations on top, like having offsets to create a diagonal stacked variable). Is this common for all type of RAM?
For an out-of-place transpose like this, to get good cache-line utilization you want to do neither order: you generally want to “tile” the loops, either by tuning to your cache or by using a cache-oblivious algorithm.
(Optimizing transposition is a heavily studied problem, with a fair amount of literature and code out there if you search.)
See also e.g. Function on matrix transpose and performance - #4 by stevengj and the links in that thread.
2 Likes
It depends on the cache-architecture, size, and configuration, how prefetching and instructions are scheduled, and a whole lot of details. Without knowledge of these details, one would guess that contiguous reads are better than contiguous writes, because reads must be completed before an operation on the data, whereas writes can be postponed, and memory operations typically occur in whole cache-lines, often 64 contiguous bytes. But a lot of details can interfere with this simple view. In practice one would like to tune the algorithm to the system at hand.
Thank you so much. I repeated the experiments with various array sizes.
The 3x speedup was a special case where size(Y,1) was of the order of cache-line width with larger size(Y,2). It essentially acted as its own tiling.
Swapping the inputs favored custom_2 but the custom here was only 1.5x slower (since there’s other optimizations that is possible). permutedims! speed is comparable to custom in both cases, so I believe the logic should be similar (not as optimized as transpose).