It should be even faster if you call copy(transpose(X))
— Julia has an optimized copy(A)
routine for transposed arrays that has good cache-line utilization. Basically, this is the problem of optimized out-of-place transposition, and has been extensively studied; the optimum is some kind of blocked or recursive cache-oblivious algorithm, and a cache-oblivious transpose routine is implemented in Julia (from #6690, I believe).
It doesn’t look like collect
calls this optimized routine, but it probably should.
Did it get faster or slower in 1.2? If maximum(X)
got much faster, the question is why and whether that can be replicated for other memory layouts. If, on the other hand, it got much slower in 1.2, then you should certainly file a performance issue.