Effect of Column Major Storage on Multiplication

rakeshvar · November 7, 2020, 7:55am

Hello,
I am multiplying a square matrix with a vector. I thought A '* x would be faster than A * x as the former is inner-product of x with columns of A, which are stored contiguously in Julia. As opposed to the latter where it is inner products with the rows of A with x, and the rows are not stored contiguously.
Thanks.

> N = 10000; 
> A = randn(N, N); 
> x = randn(N);
> @btime A*x;
  36.162 ms (2 allocations: 78.20 KiB)
> @btime A'*x;
  34.903 ms (3 allocations: 78.22 KiB)

baggepinnen · November 7, 2020, 8:41am

Your benchmark does not violate your conjecture…? A'x executed faster.

DNF · November 7, 2020, 8:46am

The reason there’s no significant difference is that it is of course not implemented as a bunch of row-by-column products. That’s just a convenient way of explaining the product intuitively, but there are many ways of organizing the operations.

Elrod · November 7, 2020, 8:19pm

I’m going to focus on matrix-matrix multiplication instead of matrix-vector.

See the LoopVectorization benchmarks here for results across a range of sizes.

Picking one size (144x144), let’s look a little closer at a very simple implementation of matrix multiplication, to guarantee that nothing too fancy is going on:

julia> using LoopVectorization, PaddedMatrices

julia> M = K = N = 144; C = Matrix{Float64}(undef, M, N); A = rand(M, K); B = rand(K, N);

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(A,1), n ∈ axes(B,2)
               Cmn = zero(eltype(C))
               for k ∈ axes(A,2)
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB! (generic function with 1 method)

julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.615 μs (0.00% GC)
  median time:      47.789 μs (0.00% GC)
  mean time:        47.895 μs (0.00% GC)
  maximum time:     129.122 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark AmulB!($C, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     69.159 μs (0.00% GC)
  median time:      69.557 μs (0.00% GC)
  mean time:        69.722 μs (0.00% GC)
  maximum time:     129.797 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The column-major version was about 45% faster!

Why is that? Because column major matrix A is actually a great way to multiply matrices. In fact, even A' * B' is better than A' * B:

julia> @benchmark AmulB!($C, $A', $B')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     59.039 μs (0.00% GC)
  median time:      60.140 μs (0.00% GC)
  mean time:        59.920 μs (0.00% GC)
  maximum time:     117.935 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Because A' * B' is like (B * A)'.

Why is this? There are two major problems with A' * B:

Dot products aren’t that efficient with respect to maximizing SIMD: you do a bunch of SIMD operations, but then at the end you need to do a bunch of extra operations to turn your SIMD vectors into a scalar before you can store it in C.
You’re moving rapidly across the memory in the arrays, because you’re loading SIMD vectors from both of them. You want to minimize memory bandwidth requirements.

For reference, A * B is basically doing:
C[1:8,1] .= A[1:8,1] * B[1,1] .+ A[1:8,2] * B[2,1] .+ A[1:8,3] * B[3,1] .+ ...
Compared to A' * B, this means that the SIMD vectors from A can be both loaded and stored directly into C without the need for any extra processing, and that we’re loading elements from B very slowly; memory bandwidth is much less of an issue for matrix B.
While for A' * B, we have
C[1,1] = sum( A[1:8,1] .* B[1:8,1] .+ A[9:16,1] .* B[9:16,1] .+ A[17:24,1] .* B[17:24,1] .+ ...)
requiring the extra sum at the end, on top of the more rapid memory churn.

If you’re on Linux, you can follow this example by using the master branch of LinuxPerf.jl (the @pstats macro isn’t in the latest release). For each @pstats below, I ran it twice to make sure nothing was compiling. A * B:

julia> using LinuxPerf

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(AmulB!, 10_000, C, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.97e+09   60.1%  #  4.1 cycles per ns
┌ instructions             6.57e+09   60.2%  #  3.3 insns per cycle
│ branch-instructions      1.40e+08   60.2%  #  2.1% of instructions
└ branch-misses            1.09e+06   60.2%  #  0.8% of branch instructions
┌ task-clock               4.82e+08  100.0%  # 481.9 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    5.72e+08   19.9%  # 27.4% of dcache loads
│ L1-dcache-loads          2.09e+09   19.9%
└ L1-icache-load-misses    2.10e+04   19.9%
┌ dTLB-load-misses         1.51e+01   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.09e+09   19.9%
┌ iTLB-load-misses         1.15e+04   40.0%  # 5801.3% of iTLB loads
└ iTLB-loads               1.97e+02   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now A' * B:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
       foreachf!(AmulB!, 10_000, C, A', B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.84e+09   59.9%  #  4.1 cycles per ns
┌ instructions             8.03e+09   60.0%  #  2.8 insns per cycle
│ branch-instructions      1.96e+08   60.0%  #  2.4% of instructions
└ branch-misses            8.68e+06   60.0%  #  4.4% of branch instructions
┌ task-clock               6.98e+08  100.0%  # 697.6 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    7.99e+08   20.1%  # 51.2% of dcache loads
│ L1-dcache-loads          1.56e+09   20.1%
└ L1-icache-load-misses    2.52e+05   20.1%
┌ dTLB-load-misses         2.20e+02   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.56e+09   20.0%
┌ iTLB-load-misses         2.97e+03   39.9%  # 736.0% of iTLB loads
└ iTLB-loads               4.04e+02   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note how for A' * B we needed 803_000 instructions (8.03e9 to do it 10_000 times), but for A * B we only needed 657_000 instructions. That’s because, as outlined above, we can store directly from A into C without any extra work.
On top of that, we ran at 3.1 instructions per clock cycle for A * B with 27.4% L1-data cache misses, while A' * B only hit 2.8 instructions per clock cycle (on top of needing many more instructions!), with 51.2% L1-data cache misses. That’s the cost of the rapid memory-churn required.

The library PaddedMatrices.jl uses LoopVectorization.jl to define an at least half serious attempt at matrix multiplication. It can calculate A' * B be much faster than the AmulB! we defined above:

julia> @benchmark PaddedMatrices.jmul!($C, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.666 μs (0.00% GC)
  median time:      55.923 μs (0.00% GC)
  mean time:        56.026 μs (0.00% GC)
  maximum time:     125.665 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

How? It’s actually faster to transpose A, copying the memory to a new location, and then doing a column major multiplication. So that is what it does.
The transpose currently is slower than it should be; MKL is much faster:

julia> using LinearAlgebra; BLAS.set_num_threads(1); BLAS.vendor()
:mkl

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     47.968 μs (0.00% GC)
  median time:      48.065 μs (0.00% GC)
  mean time:        48.155 μs (0.00% GC)
  maximum time:     114.288 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C, $A', $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.489 μs (0.00% GC)
  median time:      51.596 μs (0.00% GC)
  mean time:        51.682 μs (0.00% GC)
  maximum time:     117.158 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

but I’m planning on working on this in the not too distant future.

The point I would like to make here is: if you don’t know how libraries like MKL and OpenBLAS are implemented, don’t put too much stock in comparisons using them. They all copy your data into more efficient data structures, so it doesn’t normally matter in the end what the layout was initially.

Topic		Replies	Views
What are the pros and cons of row/column major ordering? Internals & Design arrays , column-major , row-major	10	3181	February 13, 2024
Would matrix multiplication benefit if Julia allowed for row-major arrays as well? Internals & Design question , proposal , column-major	1	908	September 20, 2018
The columns and the rows - puzzled about performance Performance question	8	905	October 14, 2021
Question About Transposes and Matrix Multiplication Speed Numerics	9	2114	August 22, 2018
Non-intuitive perf diff between `matrix * vector`, `matrix' * vector` and `copy(matrix') * vector` Performance blas	2	692	September 27, 2019

Effect of Column Major Storage on Multiplication

Related topics