Performance of matrix division

I have a question regarding the performance of matrix division (/).
During some calculations, I noticed a huge impact of 3x3-Matrix divisions compared to expectations I had from a similar matlab program. Therefore, I made some test with / as well as \. I noticed significant performance improvements for A/B if I manually step through the lines of A.
Here the Code:

using LinearAlgebra

using BenchmarkTools

function funDiv1(A, B)
    return C = A/B

function funDiv2(A,B)
    C = similar(A)
    LineNum = size(A, 1)
    for iL = 1:LineNum
        C[iL, :] = A[iL, :]'/B
    return C

A = convert(Matrix{Float64}, [1 3 7; 17 28 33; 2 0 7])
B = convert(Matrix{Float64}, [2 5 23; 1 45 2; 4 2 12])
Bf = factorize(B)

bench1 = @benchmark funDiv1($A, $Bf)
bench2 = @benchmark funDiv2($A, $Bf)

On my system, the median time is about 13 micro seconds for funDiv1 and .5 for funDiv2.
Is there a good reason for the additional computation time or can I use the second function for a performance improvement?

For 3x3 arrays, try use StaticArrays.jl:

julia> using StaticArrays

julia> A = SMatrix{3,3, Float64}([1 3 7; 17 28 33; 2 0 7]);

julia> B = SMatrix{3,3, Float64}([2 5 23; 1 45 2; 4 2 12]);

julia> @btime $A / $B
  105.613 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Float64,2,9}:
  0.237344    0.0348455  0.122617
 -1.02761     0.530572   4.63116
  0.0571992  -0.0276134  0.478304

Thanks, this is indeed a hugh improvement. I now get:

julia> A = SMatrix{3,3, ComplexF64}([1 3 7; 17 28 33; 2 0 7]);

julia> B = SMatrix{3,3, ComplexF64}([2 5 23; 1 45 2; 4 2 12]);

julia> @benchmark funDiv1($A, $B)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     579.459 ns (0.00% GC)
  median time:      580.541 ns (0.00% GC)
  mean time:        604.950 ns (0.00% GC)
  maximum time:     1.915 μs (0.00% GC)
  samples:          10000
  evals/sample:     185

julia> @benchmark funDiv2($A, $B)
  memory estimate:  160 bytes
  allocs estimate:  1
  minimum time:     353.521 ns (0.00% GC)
  median time:      387.324 ns (0.00% GC)
  mean time:        405.726 ns (0.00% GC)
  maximum time:     1.650 μs (0.00% GC)
  samples:          10000
  evals/sample:     213

So the second function is still faster. Is there a downside for this approach?

Right now the implementation for A/B works by computing (B' \ A')', which adds some overhead for small matrices. It should be straightforward to write a more specialized A/B method, and this would make a good PR.

No, I take that back: A \ B and A / B take about the same time even for 3x3 matrices, which indicates that the transposes aren’t the problem.

For the built-in Matrix type, it could be an OpenBLAS issue — OpenBLAS is optimized for large matrix–matrix operations, and so when it sees the tridiagonal solve that arises from A / B it may spend a lot of overhead deciding on the optimal blocking etcetera.

StaticArrays avoids all of this kind of overhead for small matrices, however, so if A / B is still slower than operating column-by-column with an SMatrix then this might be an issue in StaticArrays.

After looking deeper into the code of StaticArrays, I think I found the cause. solve.jl introduces specialized functions for calculations with vectors up to the length of 3. But these are not applied when A (in B\A) is a Matrix.

If the dimensions of B is 4 or higher, the performance advantage changes and funDiv1 is faster than funDiv2. Though, this is only true for StaticArrays. I tested up to a dimension of 50 and the performance of Div2 is still better than Div1 for normal Arrays.

Another issue I found with StaticArrays: While it supports lu-factorization, the factorized matrix B can only be used with B\A and not with A/B.

I will try to make a PR for StaticArrays. For the build-in Matrix, I am a bit lost in the code. Should I open an issue for Julia? or even OpenBLAS?

StaticArrays methods (at least) require that the size of the output matrix be knowable from the types of the input arrays. This is not possible if A is a Matrix.

Agree, StaticArrays is missing some methods here. Thanks for trying to make a PR.

By the way, you can do:

A = Float64.(@SMatrix [1 3 7; 17 28 33; 2 0 7])
B = Float64.(@SMatrix [2 5 23; 1 45 2; 4 2 12])

to construct the matrices more efficiently and concisely.

But isn’t the size of the output matrix the same as A and thus, if the size of A is known from its type, the complier can know the return value as well?

Not in general, since \ also applies to non-square matrices:

julia> B = rand(2, 3); A = rand(2, 4);

julia> B \ A
3×4 Array{Float64,2}:
  0.433586  0.609795   0.172143   0.345393
  0.484587  0.609794   0.0237078  0.338466
 -0.315217  0.0503478  1.0358     0.0761907

Also, we were considering the case when A is a Matrix, so the size of A is not known from its type.

Ok, I guess I was unclear, I meant that A is (as well as B) a StaticMatrix and not a StaticVector.

Your are right, that is a case I didn’t think of. I’m considering the case of B beeing square and of the dimension 3x3 or smaller.
This file includes optimized code for the case that B is 3x3 and A 3x1. And I think it can be extended to A beeing 3xn similar to the function funDiv2.