I think I understand why the first two are much quicker than the third one (column major). However I did not expect the last one to be that slow and was expecting it to be faster/better than the 3d one (I guess it may be if your data is much larger than the one I’ve tried here).

This can arise a fair bit in ML/Stats packages where devs can have chosen either the p x n or n x p convention for design matrices (typically p x n in JuliaStats but n x p in DataFrames).
At the moment a number of these algorithms (e.g. kmeans in Clustering.jl) just error if you give them a transpose which is not ideal. That’s because they’re tied to Matrix types instead of AbstractMatrix. It’s an easy fix and I intend to propose PRs for this but I was not expecting it to cause a big performance difference.

Am I missing something here? is it expected that passing the transpose is slower than passing collect(transpose)? Thanks!

Thanks for the reply, hmm that’s odd. It seems to be a problem with 1.2

Julia Version 1.2.0-DEV.169
Commit 7ed9e4fb35 (2019-01-14 22:06 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
julia> X = randn(5000, 200);
julia> Xt = transpose(X);
julia> using BenchmarkTools
julia> @btime maximum($X)
598.870 μs (0 allocations: 0 bytes)
4.88752576667973
julia> @btime maximum($Xt)
8.211 ms (0 allocations: 0 bytes)
4.88752576667973
julia> @btime maximum(collect(transpose($X)))
3.329 ms (3 allocations: 7.63 MiB)
4.88752576667973

PS: I think my point is not quite about row vs column major here but rather that I don’t understand why adding the collect makes it faster. It seems a bit counterintuitive to me since it fully allocates the object?

Edit: yes it seems to be version related, I get similar results to yours on 1.1 and no discrepancy in terms of time between using Xt and collect(transpose(X)). Should I report this somewhere?

Edit2: this does not seem to happen with all functions, I just tried with svd and get similar timings for transpose and collect/transpose. I get it with maximum , norm, sum, … so it could be a reduce problem?

It should be even faster if you call copy(transpose(X)) — Julia has an optimized copy(A) routine for transposed arrays that has good cache-line utilization. Basically, this is the problem of optimized out-of-place transposition, and has been extensively studied; the optimum is some kind of blocked or recursive cache-oblivious algorithm, and a cache-oblivious transpose routine is implemented in Julia (from #6690, I believe).

It doesn’t look like collect calls this optimized routine, but it probably should.

Did it get faster or slower in 1.2? If maximum(X) got much faster, the question is why and whether that can be replicated for other memory layouts. If, on the other hand, it got much slower in 1.2, then you should certainly file a performance issue.

Re performance, it seems to be significantly faster in 1.2 for the non-transposed array and about the same for transposed arrays (copied or not) (cf benchmark results above). This is observed with functions such as maximum, sum, norm, so maybe related to reductions. It does not happen with more complex functions like svd as far as I can tell.

I have a follow-up question- so what would be the recommended way to do A .+= transpose(B)? Everything I tried is slower than A .+= copy(transpose(B)) but I was hoping for an in-place solution

At what sizes? A .+= transpose(B) always seems quicker for me:

julia> N = 10; A = rand(N,N); B = rand(N,N);
julia> @btime $A .+= transpose($B);
min 53.794 ns, mean 54.080 ns (0 allocations)
julia> @btime $A .+= permutedims($B);
min 141.019 ns, mean 155.724 ns (1 allocation, 896 bytes. GC mean 6 ns, 3.66%)
julia> @btime $A .+= copy(transpose($B));
min 125.974 ns, mean 143.013 ns (1 allocation, 896 bytes. GC mean 6 ns, 3.85%)
julia> N = 1000; A = rand(N,N); B = rand(N,N);
julia> @btime $A .+= transpose($B);
min 623.125 μs, mean 633.971 μs (0 allocations)
julia> @btime $A .+= permutedims($B);
min 1.104 ms, mean 1.710 ms (2 allocations, 7.63 MiB. GC mean 155 μs, 9.05%)
julia> @btime $A .+= copy(transpose($B));
min 1.471 ms, mean 1.961 ms (2 allocations, 7.63 MiB. GC mean 176 μs, 8.96%)

But the original question about maximum does not seem solved. Some reductions like sum know to unwrap the transpose, but it appears that maximum does not:

julia> X = randn(5000, 200);
julia> Xt = transpose(X);
julia> @btime maximum($X);
min 647.042 μs, mean 650.693 μs (0 allocations)
julia> @btime maximum($Xt);
min 4.016 ms, mean 4.025 ms (0 allocations)
julia> @btime maximum(copy($Xt));
min 1.558 ms, mean 2.080 ms (2 allocations, 7.63 MiB. GC mean 168 μs, 8.09%)
# compare sum:
julia> @btime sum($X);
min 179.958 μs, mean 180.624 μs (0 allocations)
julia> @btime sum($Xt);
min 179.917 μs, mean 180.588 μs (0 allocations)

For maximum, calling parent is obviously better than calling copy, as the maximum of a matrix equals the maximum of its transpose.
Also, LoopVectorization will make maximum much faster. Base.maximum is not SIMD, but LoopVectorization can SIMD it. Easiest way may be LoopVectorization.vreduce(max, X).

Realize that non-trivial linear-algebra operations on an N\times N dense matrix are typically O(N^3) (e.g. solving Ax=b, least-squares, QR, eigenvalues, etc.) … is it really worth it to invest much optimization effort into O(N^2) operations like A+A^T or \max_{i,j} (A^T)_{ij} for your application?

julia> N = 5000; A = rand(N,N); B = rand(N,N);
julia> @btime $A .+= transpose($B);
48.650 ms (0 allocations: 0 bytes) # julia 1.7 via rosetta on M1 mac
75.294 ms # on 1.8 native
julia> @btime $A .+= permutedims($B);
106.922 ms (2 allocations: 190.73 MiB)
95.070 ms # on 1.8
julia> @btime $A .+= copy(transpose($B));
83.482 ms (2 allocations: 190.73 MiB)
50.543 ms # on 1.8
julia> @btime $A += transpose($B); # without the .
87.188 ms (2 allocations: 190.73 MiB)
julia> using TensorOperations, Strided # better permutedims than native
julia> @btime @tensor $A[x,y] += $B[y,x];
20.473 ms (60 allocations: 4.84 KiB)
julia> @btime @strided $A .+= transpose($B);
17.845 ms (67 allocations: 5.09 KiB)
julia> using Tullio
julia> @btime @tullio $A[x,y] += $B[y,x]; # tiled access + multi-threaded
15.338 ms (44 allocations: 2.14 KiB)
28.624 ms (0 allocations: 0 bytes) # with julia -t1
julia> using LoopVectorization
julia> @btime @turbo $A .+= transpose($B);
31.416 ms (0 allocations: 0 bytes)
julia> @btime @tturbo $A .+= transpose($B);
11.874 ms (0 allocations: 0 bytes)

What’s going on BTW is that broadcasting simply iterates, while copy(transpose(B)) has some tiled access pattern to be cache-friendly, and IIRC permutedims a different one. TensorOperations.jl / Strided.jl has some better algorithms for this.

And for maximum, I thought #39513 made all reductions fall through to parent(Xt) (long after the 1st message above), but apparently this one doesn’t.