When converting column major to row major array. When the array dimension become large. It can be tricky to implement fast conversion. Tricks like ENH: Improve performance for copying/relayout of large data · Issue #21655 · numpy/numpy · GitHub could be utilized. Is there existing julia package that support something like this? Or a cli tool/C lib to do this?
Doing this conversion in-place for rectangular arrays is a bit complicated. But even for square arrays, doing this “optimally” likely requires doing the operations blockwise to improve cache use. Doing this for arrays with >2 dimensions is even trickier (code-wise, but probably not much worse in concept). I don’t know of a package that does this Optimally™ and in 20 seconds of websearching didn’t find one, but perhaps there is one floating around somewhere.
In most cases, you can simply use PermutedDimsArray
or LinearAlgebra.transpose
to make a row-major array (only 2d arrays with transpose
). For that matter, Julia will happily use arrays in any majorness (including arbitrary permutations) via a PermutedDimsArray
.
An easy way to make your opposite-majorness copy (that does not currently use a fancy blocking strategy in an attempt to improve cache efficiency – sorry) is something like this:
julia> x_colmajor = [10i+j for i in 1:3, j in 1:4]
3×4 Matrix{Int64}:
11 12 13 14
21 22 23 24
31 32 33 34
julia> x_rowmajor = PermutedDimsArray(permutedims(x_colmajor),(2,1))
3×4 PermutedDimsArray(::Matrix{Int64}, (2, 1)) with eltype Int64:
11 12 13 14
21 22 23 24
31 32 33 34
julia> strides(x_colmajor)
(1, 3)
julia> strides(x_rowmajor)
(4, 1)
permuteddims
creates a column-major copy with the dimensions reversed (or with an arbitrary reordering, as requested) and then a PermutedDimsArray
wrapper can make Julia reinterpret that array as row-major instead (implicitly transposing it back).
If you’re building the array in Julia and then passing it to some library that requires row-major, consider starting your array as a PermutedDimsArray
(with row-major ordering), filling the entries, and then passing that to the library. Similarly, if your array is coming from an external library in row-major, wrap it in a PermutedDimsArray
and Julia will interpret it properly. Either of these methods requires no copying at all!
The main reason I imagine one would need to create a row-major copy is if you had two non-Julia codebases that required different majorness and Julia was being used to pass data between them.
Actually, it appears that LinearAlgebra.transpose!
or copy(transpose(...))
does the fancy blocking I was talking about. It only supports 2 dimensions, but that seems to be your case of interest. It uses cache-oblivious blocking rather than something hand-tuned, but should still be close to optimal performance.
julia> using LinearAlgebra, BenchmarkTools
julia> x_colmajor = randn(1000,2000);
julia> @btime x_rowmajor = PermutedDimsArray(permutedims($x_colmajor),(2,1));
11.329 ms (6 allocations: 15.26 MiB)
julia> @btime x_rowmajor = transpose(copy(transpose($x_colmajor)));
2.708 ms (2 allocations: 15.26 MiB)
I’ll still advocate for a copy-free method (a simple transpose
or PermutedDimsArray
wrapper) if it fits your use case and performs well. But sometimes you really will get better performance out of re-packing the data with a copy.
GitHub - haampie/FastTranspose.jl: Looking for a fast matrix transpose was much better than ``transpose!`, but no longer works on Julia >= 1.6. Someone could try updating and then registering it.
I never really added optimizations to LoopVectorization to help with raw transposes, but it’s still better than not using it:
julia> @time using LoopVectorization
0.409410 seconds (377.17 k allocations: 20.222 MiB, 4.68% gc time, 4.20% compilation time)
julia> lvtranspose!(B, A) = @turbo B .= A'
lvtranspose! (generic function with 1 method)
julia> lvtranspose(A) = lvtranspose!(similar(A'), A)
lvtranspose (generic function with 1 method)
julia> using LinearAlgebra, BenchmarkTools
julia> x_colmajor = randn(1000,2000);
julia> @btime x_rowmajor = PermutedDimsArray(permutedims($x_colmajor),(2,1));
7.993 ms (6 allocations: 15.26 MiB)
julia> @btime x_rowmajor = transpose(copy(transpose($x_colmajor)));
5.382 ms (2 allocations: 15.26 MiB)
julia> @btime transpose(lvtranspose($x_colmajor));
2.454 ms (2 allocations: 15.26 MiB)
It tends to do really well when fusing transposes with other operations.
It supports wrappers like transpose
, PermutedDimsArrays
, etc, so you can just use those and avoid shuffling around your memory layouts more than necessary.
A benchmark avoiding the problem of memory allocations:
julia> using LoopVectorization
julia> lvtranspose!(B, A) = @turbo B .= A'
lvtranspose! (generic function with 1 method)
julia> A = randn(1000,2000); B = similar(A');
julia> @benchmark transpose!($B, $A)
BenchmarkTools.Trial: 195 samples with 1 evaluation.
Range (min … max): 4.929 ms … 8.182 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.007 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.131 ms ± 390.928 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▄▃▁▁▁
██████████▆▆▄▁▆▄▁▁▄▁▁▄▁▄▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▄
4.93 ms Histogram: log(frequency) by time 7.46 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark lvtranspose!($B, $A)
BenchmarkTools.Trial: 478 samples with 1 evaluation.
Range (min … max): 2.050 ms … 2.349 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.071 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.082 ms ± 38.207 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆██▇▆▄▁▂ ▁
▅███████████▇█▅▆▄▆▆▅▄▆▄▄▁▄▁▆▅▄▄▄▅▁▁▄▄▄▁▁▄▁▄▄▁▁▄▁▁▁▅▄▄▁▁▁▁▄ ▇
2.05 ms Histogram: log(frequency) by time 2.27 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Yes, we’ve had this since 2014, from WIP: cache oblivious linear algebra algorithms by Jutho · Pull Request #6690 · JuliaLang/julia · GitHub
That being said, in principle you can do better at avoiding cache-associativity conflicts by doing tiling with a buffer; FFTW switched to this following Gatlin and Carter (1999). Portability to different cache sizes is less of an issue here because out-of-place matrix transposition has no temporal locality (each memory location is accessed exactly once), so it’s all about optimizing for cache lines (spatial locality).
I wanted to try throwing off the strides, so that they aren’t multiples of cacheline sizes:
julia> using LinearAlgebra, LoopVectorization, BenchmarkTools
julia> for N = 1000:1008
A = rand(N, 1000+N); B = similar(A');
println("N = $N:")
print("transpose!: ")
@btime transpose!($B, $A)
print("@turbo B .= A':")
@btime @turbo $B .= $A'
end
N = 1000:
transpose!: 4.904 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.044 ms (0 allocations: 0 bytes)
N = 1001:
transpose!: 4.970 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.551 ms (0 allocations: 0 bytes)
N = 1002:
transpose!: 4.957 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.377 ms (0 allocations: 0 bytes)
N = 1003:
transpose!: 5.098 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.360 ms (0 allocations: 0 bytes)
N = 1004:
transpose!: 5.008 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.147 ms (0 allocations: 0 bytes)
N = 1005:
transpose!: 4.970 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.293 ms (0 allocations: 0 bytes)
N = 1006:
transpose!: 5.050 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.287 ms (0 allocations: 0 bytes)
N = 1007:
transpose!: 5.081 ms (0 allocations: 0 bytes)
@turbo B .= A': 2.348 ms (0 allocations: 0 bytes)
N = 1008:
transpose!: 4.992 ms (0 allocations: 0 bytes)
@turbo B .= A': 1.991 ms (0 allocations: 0 bytes)
LV is using extremely small blocks here with limited unrolling to try and take advantage of cachelines.
It might be that this works best with AVX512, where the extremely small blocks are larger than they would be on other systems.
Here’s what I see on a Ryzen 5 5600x (no AVX512 support)
julia> using LinearAlgebra, LoopVectorization, BenchmarkTools
julia> for N = 1000:1008
A = rand(N, 1000+N); B = similar(A');
println("N = $N:")
print("transpose!: ")
@btime transpose!($B, $A)
print("@turbo B .= A':")
@btime @turbo $B .= $A'
end
N = 1000:
transpose!: 1.172 ms (0 allocations: 0 bytes)
@turbo B .= A': 385.761 μs (0 allocations: 0 bytes)
N = 1001:
transpose!: 1.292 ms (0 allocations: 0 bytes)
@turbo B .= A': 518.490 μs (0 allocations: 0 bytes)
N = 1002:
transpose!: 1.289 ms (0 allocations: 0 bytes)
@turbo B .= A': 484.340 μs (0 allocations: 0 bytes)
N = 1003:
transpose!: 1.227 ms (0 allocations: 0 bytes)
@turbo B .= A': 489.680 μs (0 allocations: 0 bytes)
N = 1004:
transpose!: 1.283 ms (0 allocations: 0 bytes)
@turbo B .= A': 409.850 μs (0 allocations: 0 bytes)
N = 1005:
transpose!: 1.255 ms (0 allocations: 0 bytes)
@turbo B .= A': 501.470 μs (0 allocations: 0 bytes)
N = 1006:
transpose!: 1.309 ms (0 allocations: 0 bytes)
@turbo B .= A': 481.690 μs (0 allocations: 0 bytes)
N = 1007:
transpose!: 1.329 ms (0 allocations: 0 bytes)
@turbo B .= A': 529.760 μs (0 allocations: 0 bytes)
N = 1008:
transpose!: 1.242 ms (0 allocations: 0 bytes)
@turbo B .= A': 401.190 μs (0 allocations: 0 bytes)
It seems to like LoopVectorization’s choices even more than your CPU does.