Effect of Column Major Storage on Multiplication

Hello,
I am multiplying a square matrix with a vector. I thought A '* x would be faster than A * x as the former is inner-product of x with columns of A, which are stored contiguously in Julia. As opposed to the latter where it is inner products with the rows of A with x, and the rows are not stored contiguously.
Thanks.

> N = 10000; 
> A = randn(N, N); 
> x = randn(N);
> @btime A*x;
  36.162 ms (2 allocations: 78.20 KiB)
> @btime A'*x;
  34.903 ms (3 allocations: 78.22 KiB)

Your benchmark does not violate your conjecture…? A'x executed faster.

The reason there’s no significant difference is that it is of course not implemented as a bunch of row-by-column products. That’s just a convenient way of explaining the product intuitively, but there are many ways of organizing the operations.

2 Likes

I’m going to focus on matrix-matrix multiplication instead of matrix-vector.

See the LoopVectorization benchmarks here for results across a range of sizes.

Picking one size (144x144), let’s look a little closer at a very simple implementation of matrix multiplication, to guarantee that nothing too fancy is going on:

julia> using LoopVectorization, PaddedMatrices

julia> M = K = N = 144; C = Matrix{Float64}(undef, M, N); A = rand(M, K); B = rand(K, N);

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(A,1), n ∈ axes(B,2)
               Cmn = zero(eltype(C))
               for k ∈ axes(A,2)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.615 ΞΌs (0.00% GC)
  median time:      47.789 ΞΌs (0.00% GC)
  mean time:        47.895 ΞΌs (0.00% GC)
  maximum time:     129.122 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AmulB!($C, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.159 ΞΌs (0.00% GC)
  median time:      69.557 ΞΌs (0.00% GC)
  mean time:        69.722 ΞΌs (0.00% GC)
  maximum time:     129.797 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The column-major version was about 45% faster!

Why is that? Because column major matrix A is actually a great way to multiply matrices. In fact, even A' * B' is better than A' * B:

julia> @benchmark AmulB!($C, $A', $B')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     59.039 ΞΌs (0.00% GC)
  median time:      60.140 ΞΌs (0.00% GC)
  mean time:        59.920 ΞΌs (0.00% GC)
  maximum time:     117.935 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Because A' * B' is like (B * A)'.

Why is this? There are two major problems with A' * B:

  1. Dot products aren’t that efficient with respect to maximizing SIMD: you do a bunch of SIMD operations, but then at the end you need to do a bunch of extra operations to turn your SIMD vectors into a scalar before you can store it in C.
  2. You’re moving rapidly across the memory in the arrays, because you’re loading SIMD vectors from both of them. You want to minimize memory bandwidth requirements.

For reference, A * B is basically doing:
C[1:8,1] .= A[1:8,1] * B[1,1] .+ A[1:8,2] * B[2,1] .+ A[1:8,3] * B[3,1] .+ ...
Compared to A' * B, this means that the SIMD vectors from A can be both loaded and stored directly into C without the need for any extra processing, and that we’re loading elements from B very slowly; memory bandwidth is much less of an issue for matrix B.
While for A' * B, we have
C[1,1] = sum( A[1:8,1] .* B[1:8,1] .+ A[9:16,1] .* B[9:16,1] .+ A[17:24,1] .* B[17:24,1] .+ ...)
requiring the extra sum at the end, on top of the more rapid memory churn.

If you’re on Linux, you can follow this example by using the master branch of LinuxPerf.jl (the @pstats macro isn’t in the latest release). For each @pstats below, I ran it twice to make sure nothing was compiling. A * B:

julia> using LinuxPerf

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(AmulB!, 10_000, C, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               1.97e+09   60.1%  #  4.1 cycles per ns
β”Œ instructions             6.57e+09   60.2%  #  3.3 insns per cycle
β”‚ branch-instructions      1.40e+08   60.2%  #  2.1% of instructions
β”” branch-misses            1.09e+06   60.2%  #  0.8% of branch instructions
β”Œ task-clock               4.82e+08  100.0%  # 481.9 ms
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    5.72e+08   19.9%  # 27.4% of dcache loads
β”‚ L1-dcache-loads          2.09e+09   19.9%
β”” L1-icache-load-misses    2.10e+04   19.9%
β”Œ dTLB-load-misses         1.51e+01   19.9%  #  0.0% of dTLB loads
β”” dTLB-loads               2.09e+09   19.9%
β”Œ iTLB-load-misses         1.15e+04   40.0%  # 5801.3% of iTLB loads
β”” iTLB-loads               1.97e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now A' * B:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(AmulB!, 10_000, C, A', B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
β•Ά cpu-cycles               2.84e+09   59.9%  #  4.1 cycles per ns
β”Œ instructions             8.03e+09   60.0%  #  2.8 insns per cycle
β”‚ branch-instructions      1.96e+08   60.0%  #  2.4% of instructions
β”” branch-misses            8.68e+06   60.0%  #  4.4% of branch instructions
β”Œ task-clock               6.98e+08  100.0%  # 697.6 ms
β”‚ context-switches         0.00e+00  100.0%
β”‚ cpu-migrations           0.00e+00  100.0%
β”” page-faults              0.00e+00  100.0%
β”Œ L1-dcache-load-misses    7.99e+08   20.1%  # 51.2% of dcache loads
β”‚ L1-dcache-loads          1.56e+09   20.1%
β”” L1-icache-load-misses    2.52e+05   20.1%
β”Œ dTLB-load-misses         2.20e+02   20.0%  #  0.0% of dTLB loads
β”” dTLB-loads               1.56e+09   20.0%
β”Œ iTLB-load-misses         2.97e+03   39.9%  # 736.0% of iTLB loads
β”” iTLB-loads               4.04e+02   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note how for A' * B we needed 803_000 instructions (8.03e9 to do it 10_000 times), but for A * B we only needed 657_000 instructions. That’s because, as outlined above, we can store directly from A into C without any extra work.
On top of that, we ran at 3.1 instructions per clock cycle for A * B with 27.4% L1-data cache misses, while A' * B only hit 2.8 instructions per clock cycle (on top of needing many more instructions!), with 51.2% L1-data cache misses. That’s the cost of the rapid memory-churn required.

The library PaddedMatrices.jl uses LoopVectorization.jl to define an at least half serious attempt at matrix multiplication. It can calculate A' * B be much faster than the AmulB! we defined above:

julia> @benchmark PaddedMatrices.jmul!($C, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.666 ΞΌs (0.00% GC)
  median time:      55.923 ΞΌs (0.00% GC)
  mean time:        56.026 ΞΌs (0.00% GC)
  maximum time:     125.665 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

How? It’s actually faster to transpose A, copying the memory to a new location, and then doing a column major multiplication. So that is what it does.
The transpose currently is slower than it should be; MKL is much faster:

julia> using LinearAlgebra; BLAS.set_num_threads(1); BLAS.vendor()
:mkl

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.968 ΞΌs (0.00% GC)
  median time:      48.065 ΞΌs (0.00% GC)
  mean time:        48.155 ΞΌs (0.00% GC)
  maximum time:     114.288 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.489 ΞΌs (0.00% GC)
  median time:      51.596 ΞΌs (0.00% GC)
  mean time:        51.682 ΞΌs (0.00% GC)
  maximum time:     117.158 ΞΌs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

but I’m planning on working on this in the not too distant future.

The point I would like to make here is: if you don’t know how libraries like MKL and OpenBLAS are implemented, don’t put too much stock in comparisons using them. They all copy your data into more efficient data structures, so it doesn’t normally matter in the end what the layout was initially.

7 Likes