# Effect of Column Major Storage on Multiplication

Hello,
I am multiplying a square matrix with a vector. I thought `A '* x` would be faster than `A * x` as the former is inner-product of `x` with columns of `A`, which are stored contiguously in Julia. As opposed to the latter where it is inner products with the rows of `A` with `x`, and the rows are not stored contiguously.
Thanks.

``````> N = 10000;
> A = randn(N, N);
> x = randn(N);
> @btime A*x;
36.162 ms (2 allocations: 78.20 KiB)
> @btime A'*x;
34.903 ms (3 allocations: 78.22 KiB)
``````

Your benchmark does not violate your conjectureβ¦? `A'x` executed faster.

The reason thereβs no significant difference is that it is of course not implemented as a bunch of row-by-column products. Thatβs just a convenient way of explaining the product intuitively, but there are many ways of organizing the operations.

2 Likes

Iβm going to focus on matrix-matrix multiplication instead of matrix-vector.

See the LoopVectorization benchmarks here for results across a range of sizes.

Picking one size (144x144), letβs look a little closer at a very simple implementation of matrix multiplication, to guarantee that nothing too fancy is going on:

``````julia> using LoopVectorization, PaddedMatrices

julia> M = K = N = 144; C = Matrix{Float64}(undef, M, N); A = rand(M, K); B = rand(K, N);

julia> function AmulB!(C, A, B)
@avx for m β axes(A,1), n β axes(B,2)
Cmn = zero(eltype(C))
for k β axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)

julia> @benchmark AmulB!(\$C, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     47.615 ΞΌs (0.00% GC)
median time:      47.789 ΞΌs (0.00% GC)
mean time:        47.895 ΞΌs (0.00% GC)
maximum time:     129.122 ΞΌs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark AmulB!(\$C, \$A', \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     69.159 ΞΌs (0.00% GC)
median time:      69.557 ΞΌs (0.00% GC)
mean time:        69.722 ΞΌs (0.00% GC)
maximum time:     129.797 ΞΌs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

The column-major version was about 45% faster!

Why is that? Because column major matrix `A` is actually a great way to multiply matrices. In fact, even `A' * B'` is better than `A' * B`:

``````julia> @benchmark AmulB!(\$C, \$A', \$B')
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     59.039 ΞΌs (0.00% GC)
median time:      60.140 ΞΌs (0.00% GC)
mean time:        59.920 ΞΌs (0.00% GC)
maximum time:     117.935 ΞΌs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

Because `A' * B'` is like `(B * A)'`.

Why is this? There are two major problems with `A' * B`:

1. Dot products arenβt that efficient with respect to maximizing SIMD: you do a bunch of SIMD operations, but then at the end you need to do a bunch of extra operations to turn your SIMD vectors into a scalar before you can store it in `C`.
2. Youβre moving rapidly across the memory in the arrays, because youβre loading SIMD vectors from both of them. You want to minimize memory bandwidth requirements.

For reference, `A * B` is basically doing:
`C[1:8,1] .= A[1:8,1] * B[1,1] .+ A[1:8,2] * B[2,1] .+ A[1:8,3] * B[3,1] .+ ...`
Compared to `A' * B`, this means that the SIMD vectors from `A` can be both loaded and stored directly into `C` without the need for any extra processing, and that weβre loading elements from `B` very slowly; memory bandwidth is much less of an issue for matrix `B`.
While for `A' * B`, we have
`C[1,1] = sum( A[1:8,1] .* B[1:8,1] .+ A[9:16,1] .* B[9:16,1] .+ A[17:24,1] .* B[17:24,1] .+ ...)`
requiring the extra `sum` at the end, on top of the more rapid memory churn.

If youβre on Linux, you can follow this example by using the master branch of `LinuxPerf.jl` (the `@pstats` macro isnβt in the latest release). For each `@pstats` below, I ran it twice to make sure nothing was compiling. `A * B`:

``````julia> using LinuxPerf

julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)

foreachf!(AmulB!, 10_000, C, A, B)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               1.97e+09   60.1%  #  4.1 cycles per ns
β instructions             6.57e+09   60.2%  #  3.3 insns per cycle
β branch-instructions      1.40e+08   60.2%  #  2.1% of instructions
β branch-misses            1.09e+06   60.2%  #  0.8% of branch instructions
β task-clock               4.82e+08  100.0%  # 481.9 ms
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              0.00e+00  100.0%
βββββββββββββββββββββββββββββββββββββββββββ
``````

Now `A' * B`:

``````julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(AmulB!, 10_000, C, A', B)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               2.84e+09   59.9%  #  4.1 cycles per ns
β instructions             8.03e+09   60.0%  #  2.8 insns per cycle
β branch-instructions      1.96e+08   60.0%  #  2.4% of instructions
β branch-misses            8.68e+06   60.0%  #  4.4% of branch instructions
β task-clock               6.98e+08  100.0%  # 697.6 ms
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              0.00e+00  100.0%
βββββββββββββββββββββββββββββββββββββββββββ
``````

Note how for `A' * B` we needed `803_000` instructions (8.03e9 to do it 10_000 times), but for `A * B` we only needed `657_000` instructions. Thatβs because, as outlined above, we can store directly from `A` into `C` without any extra work.
On top of that, we ran at 3.1 instructions per clock cycle for `A * B` with 27.4% L1-data cache misses, while `A' * B` only hit 2.8 instructions per clock cycle (on top of needing many more instructions!), with 51.2% L1-data cache misses. Thatβs the cost of the rapid memory-churn required.

The library `PaddedMatrices.jl` uses `LoopVectorization.jl` to define an at least half serious attempt at matrix multiplication. It can calculate `A' * B` be much faster than the `AmulB!` we defined above:

``````julia> @benchmark PaddedMatrices.jmul!(\$C, \$A', \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     55.666 ΞΌs (0.00% GC)
median time:      55.923 ΞΌs (0.00% GC)
mean time:        56.026 ΞΌs (0.00% GC)
maximum time:     125.665 ΞΌs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

How? Itβs actually faster to transpose `A`, copying the memory to a new location, and then doing a column major multiplication. So that is what it does.
The transpose currently is slower than it should be; MKL is much faster:

``````julia> using LinearAlgebra; BLAS.set_num_threads(1); BLAS.vendor()
:mkl

julia> @benchmark mul!(\$C, \$A, \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     47.968 ΞΌs (0.00% GC)
median time:      48.065 ΞΌs (0.00% GC)
mean time:        48.155 ΞΌs (0.00% GC)
maximum time:     114.288 ΞΌs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark mul!(\$C, \$A', \$B)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     51.489 ΞΌs (0.00% GC)
median time:      51.596 ΞΌs (0.00% GC)
mean time:        51.682 ΞΌs (0.00% GC)
maximum time:     117.158 ΞΌs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
``````

but Iβm planning on working on this in the not too distant future.

The point I would like to make here is: if you donβt know how libraries like MKL and OpenBLAS are implemented, donβt put too much stock in comparisons using them. They all copy your data into more efficient data structures, so it doesnβt normally matter in the end what the layout was initially.

7 Likes