Contiguous Read Non-Contiguous Write vs Non-Contiguous Read Contigous Write Performance

mj2984 · May 1, 2025, 10:22pm

function permutedims_custom(Y,X)
    for idx in range(1,size(Y,2))
        @simd for idy in range(1,size(Y,1))
            @inbounds Y[idy,idx] = X[idx,idy]
        end
    end
end

function permutedims_custom_2(Y,X)
    for idx in range(1,size(Y,1))
        @simd for idy in range(1,size(Y,2))
            @inbounds Y[idx,idy] = X[idy,idx]
        end
    end
end

In the above case permutedims_custom outperforms permutedims_custom_2 by a fair margin (3x). The Contiguous write consistently outperforms the other method (even if I put mild customizations on top, like having offsets to create a diagonal stacked variable). Is this common for all type of RAM?

stevengj · May 2, 2025, 1:57am

For an out-of-place transpose like this, to get good cache-line utilization you want to do neither order: you generally want to “tile” the loops, either by tuning to your cache or by using a cache-oblivious algorithm.

(Optimizing transposition is a heavily studied problem, with a fair amount of literature and code out there if you search.)

See also e.g. Function on matrix transpose and performance - #4 by stevengj and the links in that thread.

sgaure · May 2, 2025, 9:28am

It depends on the cache-architecture, size, and configuration, how prefetching and instructions are scheduled, and a whole lot of details. Without knowledge of these details, one would guess that contiguous reads are better than contiguous writes, because reads must be completed before an operation on the data, whereas writes can be postponed, and memory operations typically occur in whole cache-lines, often 64 contiguous bytes. But a lot of details can interfere with this simple view. In practice one would like to tune the algorithm to the system at hand.

mj2984 · May 3, 2025, 12:50am

Thank you so much. I repeated the experiments with various array sizes.

The 3x speedup was a special case where size(Y,1) was of the order of cache-line width with larger size(Y,2). It essentially acted as its own tiling.

Swapping the inputs favored custom_2 but the custom here was only 1.5x slower (since there’s other optimizations that is possible). permutedims! speed is comparable to custom in both cases, so I believe the logic should be similar (not as optimized as transpose).

Topic		Replies	Views
Performance differences of contiguous vs non-contiguous column indexing? Performance question	2	871	January 12, 2021
Create set of of non-slice arrays referencing contiguous memory General Usage memory , memory-address	7	202	March 21, 2024
Question about coalesced read and write to the global memory using CUDA.jl 2D grid GPU question	1	791	April 20, 2023
Speeding up operations on large arrays Performance	6	527	December 21, 2022
Random memory access performance - how to hint compiler/runtime to preload/cache/plan Performance question , performance , memory , turbo	0	262	January 16, 2023

Contiguous Read Non-Contiguous Write vs Non-Contiguous Read Contigous Write Performance

Related topics