Function on matrix transpose and performance

performance

#1

Here’s a quick example:

foo(X::Matrix{Float64}) = maximum(X)
bar(X::AbstractMatrix{Float64}) = maximum(X)

X = randn(5000, 200);
Xt = transpose(X);

@btime foo($X) # 1
600.898 μs (0 allocations: 0 bytes)
4.697355507091318
@btime bar($X) # 2
 601.388 μs (0 allocations: 0 bytes)
4.697355507091318
@btime foo(collect(transpose($X))) # 3
3.335 ms (3 allocations: 7.63 MiB)
4.697355507091318
@btime bar($Xt) # 4
8.347 ms (0 allocations: 0 bytes)
4.697355507091318

I think I understand why the first two are much quicker than the third one (column major). However I did not expect the last one to be that slow and was expecting it to be faster/better than the 3d one (I guess it may be if your data is much larger than the one I’ve tried here).

This can arise a fair bit in ML/Stats packages where devs can have chosen either the p x n or n x p convention for design matrices (typically p x n in JuliaStats but n x p in DataFrames).
At the moment a number of these algorithms (e.g. kmeans in Clustering.jl) just error if you give them a transpose which is not ideal. That’s because they’re tied to Matrix types instead of AbstractMatrix. It’s an easy fix and I intend to propose PRs for this but I was not expecting it to cause a big performance difference.

Am I missing something here? is it expected that passing the transpose is slower than passing collect(transpose)? Thanks!


About the convention for the design matrix (row-as-observation or column-as-observation)
#2

I can’t reproduce (in Julia 1.1). If I do

X = randn(5000, 200);
Xt = transpose(X);
@btime maximum($X);
@btime maximum($Xt);

I get

 5.617 ms (0 allocations: 0 bytes)
 8.111 ms (0 allocations: 0 bytes)

(A slight slowdown, probably due to poor cache-line utilization in the transposed-access case.)


#3

Thanks for the reply, hmm that’s odd. It seems to be a problem with 1.2

Julia Version 1.2.0-DEV.169
Commit 7ed9e4fb35 (2019-01-14 22:06 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)

julia> X = randn(5000, 200);
julia> Xt = transpose(X);
julia> using BenchmarkTools
julia> @btime maximum($X)
  598.870 μs (0 allocations: 0 bytes)
4.88752576667973
julia> @btime maximum($Xt)
  8.211 ms (0 allocations: 0 bytes)
4.88752576667973
julia> @btime maximum(collect(transpose($X)))
  3.329 ms (3 allocations: 7.63 MiB)
4.88752576667973

PS: I think my point is not quite about row vs column major here but rather that I don’t understand why adding the collect makes it faster. It seems a bit counterintuitive to me since it fully allocates the object?

Edit: yes it seems to be version related, I get similar results to yours on 1.1 and no discrepancy in terms of time between using Xt and collect(transpose(X)). Should I report this somewhere?

Edit2: this does not seem to happen with all functions, I just tried with svd and get similar timings for transpose and collect/transpose. I get it with maximum , norm, sum, … so it could be a reduce problem?


#4

It should be even faster if you call copy(transpose(X)) — Julia has an optimized copy(A) routine for transposed arrays that has good cache-line utilization. Basically, this is the problem of optimized out-of-place transposition, and has been extensively studied; the optimum is some kind of blocked or recursive cache-oblivious algorithm, and a cache-oblivious transpose routine is implemented in Julia (from #6690, I believe).

It doesn’t look like collect calls this optimized routine, but it probably should.

Did it get faster or slower in 1.2? If maximum(X) got much faster, the question is why and whether that can be replicated for other memory layouts. If, on the other hand, it got much slower in 1.2, then you should certainly file a performance issue.


#5

Ok thank you, I shall use copy(transpose(X)).

Re performance, it seems to be significantly faster in 1.2 for the non-transposed array and about the same for transposed arrays (copied or not) (cf benchmark results above). This is observed with functions such as maximum, sum, norm, so maybe related to reductions. It does not happen with more complex functions like svd as far as I can tell.