Just a random thought. Consider multiplication of three matrices: A * (B * C)
If A is column-major, would the execution times of the outer product be substantially smaller (if the size of the product B * C and matrix A exceed CPU cache size) if Julia would optimize the product (B * C) as a row-major temporary matrix?
No. The blocking performed by an optimized BLAS implementation mostly eliminates the difference between row-major and column-major memory layouts. You can easily test this for yourself, because a transposed column-major array is equivalent to a row-major array, and so doing A*B' in Julia is equivalent to column-major × row-major. (Julia evaluates A*B' by calling a BLAS routine that operates in-place on A and B.)
julia> using BenchmarkTools
julia> A = rand(1000,1000); B = copy(A);
julia> f(a,b) = a*b
f (generic function with 1 method)
julia> g(a,b) = a'*b
g (generic function with 1 method)
julia> h(a,b) = a*b'
h (generic function with 1 method)
julia> @btime f($A,$B) evals=1;
13.416 ms (2 allocations: 7.63 MiB)
julia> @btime g($A,$B) evals=1;
13.364 ms (2 allocations: 7.63 MiB)
julia> @btime h($A,$B) evals=1;
13.560 ms (2 allocations: 7.63 MiB)
The timing differences here are less than 2%, which is probably within the noise.
Of course, there are other reasons to support row-major formats, mainly to pass data to/from external libraries without making copies, but this can be done by add-on packages. (e.g. PyCall already supports this to provide copy-free views of row-major NumPy arrays.)