Function on matrix transpose and performance

Here’s a quick example:

foo(X::Matrix{Float64}) = maximum(X)
bar(X::AbstractMatrix{Float64}) = maximum(X)

X = randn(5000, 200);
Xt = transpose(X);

@btime foo($X) # 1
600.898 μs (0 allocations: 0 bytes)
4.697355507091318
@btime bar($X) # 2
 601.388 μs (0 allocations: 0 bytes)
4.697355507091318
@btime foo(collect(transpose($X))) # 3
3.335 ms (3 allocations: 7.63 MiB)
4.697355507091318
@btime bar($Xt) # 4
8.347 ms (0 allocations: 0 bytes)
4.697355507091318

I think I understand why the first two are much quicker than the third one (column major). However I did not expect the last one to be that slow and was expecting it to be faster/better than the 3d one (I guess it may be if your data is much larger than the one I’ve tried here).

This can arise a fair bit in ML/Stats packages where devs can have chosen either the p x n or n x p convention for design matrices (typically p x n in JuliaStats but n x p in DataFrames).
At the moment a number of these algorithms (e.g. kmeans in Clustering.jl) just error if you give them a transpose which is not ideal. That’s because they’re tied to Matrix types instead of AbstractMatrix. It’s an easy fix and I intend to propose PRs for this but I was not expecting it to cause a big performance difference.

Am I missing something here? is it expected that passing the transpose is slower than passing collect(transpose)? Thanks!

1 Like

I can’t reproduce (in Julia 1.1). If I do

X = randn(5000, 200);
Xt = transpose(X);
@btime maximum($X);
@btime maximum($Xt);

I get

 5.617 ms (0 allocations: 0 bytes)
 8.111 ms (0 allocations: 0 bytes)

(A slight slowdown, probably due to poor cache-line utilization in the transposed-access case.)

1 Like

Thanks for the reply, hmm that’s odd. It seems to be a problem with 1.2

Julia Version 1.2.0-DEV.169
Commit 7ed9e4fb35 (2019-01-14 22:06 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> X = randn(5000, 200);
julia> Xt = transpose(X);
julia> using BenchmarkTools
julia> @btime maximum($X)
  598.870 μs (0 allocations: 0 bytes)
4.88752576667973
julia> @btime maximum($Xt)
  8.211 ms (0 allocations: 0 bytes)
4.88752576667973
julia> @btime maximum(collect(transpose($X)))
  3.329 ms (3 allocations: 7.63 MiB)
4.88752576667973

PS: I think my point is not quite about row vs column major here but rather that I don’t understand why adding the collect makes it faster. It seems a bit counterintuitive to me since it fully allocates the object?

Edit: yes it seems to be version related, I get similar results to yours on 1.1 and no discrepancy in terms of time between using Xt and collect(transpose(X)). Should I report this somewhere?

Edit2: this does not seem to happen with all functions, I just tried with svd and get similar timings for transpose and collect/transpose. I get it with maximum , norm, sum, … so it could be a reduce problem?

It should be even faster if you call copy(transpose(X)) — Julia has an optimized copy(A) routine for transposed arrays that has good cache-line utilization. Basically, this is the problem of optimized out-of-place transposition, and has been extensively studied; the optimum is some kind of blocked or recursive cache-oblivious algorithm, and a cache-oblivious transpose routine is implemented in Julia (from #6690, I believe).

It doesn’t look like collect calls this optimized routine, but it probably should.

Did it get faster or slower in 1.2? If maximum(X) got much faster, the question is why and whether that can be replicated for other memory layouts. If, on the other hand, it got much slower in 1.2, then you should certainly file a performance issue.

1 Like

Ok thank you, I shall use copy(transpose(X)).

Re performance, it seems to be significantly faster in 1.2 for the non-transposed array and about the same for transposed arrays (copied or not) (cf benchmark results above). This is observed with functions such as maximum, sum, norm, so maybe related to reductions. It does not happen with more complex functions like svd as far as I can tell.

I have a follow-up question- so what would be the recommended way to do A .+= transpose(B)? Everything I tried is slower than A .+= copy(transpose(B)) but I was hoping for an in-place solution

What size arrays?

julia> A = rand(100,100); B = rand(100,100);

julia> @benchmark $A .+= transpose($B)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  4.395 μs …  7.586 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.412 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.419 μs ± 65.054 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▃▅▇██▇▅▃                                       ▁ ▁       ▂
  ▆██████████▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▁▅▇██████▇▇▅▅ █
  4.39 μs      Histogram: log(frequency) by time     4.56 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $A .+= copy(transpose($B))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.946 μs … 718.973 μs  ┊ GC (min … max): 0.00% … 94.31%
 Time  (median):     14.799 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.697 μs ±  19.752 μs  ┊ GC (mean ± σ):  4.92% ±  3.83%

           ▂▄▆█▇▅▃▂
  ▂▁▁▂▂▂▃▄▆████████▇▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  13.9 μs         Histogram: frequency by time         17.7 μs <

 Memory estimate: 78.17 KiB, allocs estimate: 2.

julia> @benchmark @turbo $A .+= transpose($B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.501 μs …  4.613 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.508 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.511 μs ± 40.359 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▇█▂
  ▂▃▇███▄▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂ ▂
  1.5 μs         Histogram: frequency by time        1.61 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

@turbo is the fastest for me for 100x100, and the copy is the slowest.

At what sizes? A .+= transpose(B) always seems quicker for me:

julia> N = 10; A = rand(N,N); B = rand(N,N);

julia> @btime $A .+= transpose($B);
  min 53.794 ns, mean 54.080 ns (0 allocations)

julia> @btime $A .+= permutedims($B);
  min 141.019 ns, mean 155.724 ns (1 allocation, 896 bytes. GC mean 6 ns, 3.66%)

julia> @btime $A .+= copy(transpose($B));
  min 125.974 ns, mean 143.013 ns (1 allocation, 896 bytes. GC mean 6 ns, 3.85%)

julia> N = 1000; A = rand(N,N); B = rand(N,N);

julia> @btime $A .+= transpose($B);
  min 623.125 μs, mean 633.971 μs (0 allocations)

julia> @btime $A .+= permutedims($B);
  min 1.104 ms, mean 1.710 ms (2 allocations, 7.63 MiB. GC mean 155 μs, 9.05%)

julia> @btime $A .+= copy(transpose($B));
  min 1.471 ms, mean 1.961 ms (2 allocations, 7.63 MiB. GC mean 176 μs, 8.96%)

But the original question about maximum does not seem solved. Some reductions like sum know to unwrap the transpose, but it appears that maximum does not:

julia> X = randn(5000, 200);

julia> Xt = transpose(X);

julia> @btime maximum($X);
  min 647.042 μs, mean 650.693 μs (0 allocations)

julia> @btime maximum($Xt);
  min 4.016 ms, mean 4.025 ms (0 allocations)

julia> @btime maximum(copy($Xt));
  min 1.558 ms, mean 2.080 ms (2 allocations, 7.63 MiB. GC mean 168 μs, 8.09%)

# compare sum:

julia> @btime sum($X);
  min 179.958 μs, mean 180.624 μs (0 allocations)

julia> @btime sum($Xt);
  min 179.917 μs, mean 180.588 μs (0 allocations)

For maximum, calling parent is obviously better than calling copy, as the maximum of a matrix equals the maximum of its transpose.
Also, LoopVectorization will make maximum much faster. Base.maximum is not SIMD, but LoopVectorization can SIMD it. Easiest way may be LoopVectorization.vreduce(max, X).

1 Like

Realize that non-trivial linear-algebra operations on an N\times N dense matrix are typically O(N^3) (e.g. solving Ax=b, least-squares, QR, eigenvalues, etc.) … is it really worth it to invest much optimization effort into O(N^2) operations like A+A^T or \max_{i,j} (A^T)_{ij} for your application?

1 Like

Sorry I should have given an example. The following is with N=5000.

julia> N = 5000; A = randn(N, N); B = copy(A);

julia> @benchmark $A .+= transpose($B)
BenchmarkTools.Trial: 34 samples with 1 evaluation.
 Range (min … max):  146.471 ms … 160.617 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     151.444 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   151.021 ms ±   3.185 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃ █    ▃          ▃   █  █      ▃
  ▇▁▇█▇█▇▇▇▁█▁▁▇▇▇▁▁▁▁▁█▇▇▁█▇▁█▁▇▁▇▁▇█▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  146 ms           Histogram: frequency by time          161 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $A .+= copy(transpose($B))
BenchmarkTools.Trial: 46 samples with 1 evaluation.
 Range (min … max):  102.736 ms … 119.048 ms  ┊ GC (min … max):  1.31% … 13.51%
 Time  (median):     107.652 ms               ┊ GC (median):    13.27%
 Time  (mean ± σ):   109.394 ms ±   4.559 ms  ┊ GC (mean ± σ):  13.23% ±  1.92%

            ▆█ ▃     ▁
  ▄▁▁▁▁▁▁▁▁▄██▄█▄▇▄▄▄█▁▄▇▁▄▁▁▄▁▁▁▄▁▁▄▇▁▁▄▁▁▁▁▄▁▁▁▄▁▁▁▁▁▁▄▄▄▇▁▁▇ ▁
  103 ms           Histogram: frequency by time          119 ms <

 Memory estimate: 190.73 MiB, allocs estimate: 2.

julia> versioninfo()
Julia Version 1.8.0-DEV.310
Commit cb30aa7a08 (2021-08-06 03:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

@stevengj no, but I was just surprised at the time difference between A += B and A += B’

OK, at 5000 I sometimes see a gap…

julia> N = 5000; A = rand(N,N); B = rand(N,N);

julia> @btime $A .+= transpose($B);
  48.650 ms (0 allocations: 0 bytes) # julia 1.7 via rosetta on M1 mac
  75.294 ms # on 1.8 native

julia> @btime $A .+= permutedims($B);
  106.922 ms (2 allocations: 190.73 MiB)
  95.070 ms # on 1.8

julia> @btime $A .+= copy(transpose($B));
  83.482 ms (2 allocations: 190.73 MiB)
  50.543 ms # on 1.8

julia> @btime $A += transpose($B); # without the .
  87.188 ms (2 allocations: 190.73 MiB)

julia> using TensorOperations, Strided  # better permutedims than native

julia> @btime @tensor $A[x,y] += $B[y,x];
  20.473 ms (60 allocations: 4.84 KiB)

julia> @btime @strided $A .+= transpose($B);
  17.845 ms (67 allocations: 5.09 KiB)

julia> using Tullio

julia> @btime @tullio $A[x,y] += $B[y,x];  # tiled access + multi-threaded
  15.338 ms (44 allocations: 2.14 KiB)
  28.624 ms (0 allocations: 0 bytes)  # with julia -t1

julia> using LoopVectorization

julia> @btime @turbo $A .+= transpose($B);
  31.416 ms (0 allocations: 0 bytes)

julia> @btime @tturbo $A .+= transpose($B);
  11.874 ms (0 allocations: 0 bytes)

What’s going on BTW is that broadcasting simply iterates, while copy(transpose(B)) has some tiled access pattern to be cache-friendly, and IIRC permutedims a different one. TensorOperations.jl / Strided.jl has some better algorithms for this.

And for maximum, I thought #39513 made all reductions fall through to parent(Xt) (long after the 1st message above), but apparently this one doesn’t.

1 Like