I have a question regarding the performance of matrix division (/).
During some calculations, I noticed a huge impact of 3x3-Matrix divisions compared to expectations I had from a similar matlab program. Therefore, I made some test with / as well as \. I noticed significant performance improvements for A/B if I manually step through the lines of A.
Here the Code:

using LinearAlgebra
using BenchmarkTools
function funDiv1(A, B)
return C = A/B
end
function funDiv2(A,B)
C = similar(A)
LineNum = size(A, 1)
for iL = 1:LineNum
C[iL, :] = A[iL, :]'/B
end
return C
end
A = convert(Matrix{Float64}, [1 3 7; 17 28 33; 2 0 7])
B = convert(Matrix{Float64}, [2 5 23; 1 45 2; 4 2 12])
Bf = factorize(B)
bench1 = @benchmark funDiv1($A, $Bf)
bench2 = @benchmark funDiv2($A, $Bf)

On my system, the median time is about 13 micro seconds for funDiv1 and .5 for funDiv2.
Is there a good reason for the additional computation time or can I use the second function for a performance improvement?

Right now the implementation for A/B works by computing (B' \ A')', which adds some overhead for small matrices. It should be straightforward to write a more specialized A/B method, and this would make a good PR.

No, I take that back: A \ B and A / B take about the same time even for 3x3 matrices, which indicates that the transposes aren’t the problem.

For the built-in Matrix type, it could be an OpenBLAS issue — OpenBLAS is optimized for large matrix–matrix operations, and so when it sees the tridiagonal solve that arises from A / B it may spend a lot of overhead deciding on the optimal blocking etcetera.

StaticArrays avoids all of this kind of overhead for small matrices, however, so if A / B is still slower than operating column-by-column with an SMatrix then this might be an issue in StaticArrays.

After looking deeper into the code of StaticArrays, I think I found the cause. solve.jl introduces specialized functions for calculations with vectors up to the length of 3. But these are not applied when A (in B\A) is a Matrix.

If the dimensions of B is 4 or higher, the performance advantage changes and funDiv1 is faster than funDiv2. Though, this is only true for StaticArrays. I tested up to a dimension of 50 and the performance of Div2 is still better than Div1 for normal Arrays.

Another issue I found with StaticArrays: While it supports lu-factorization, the factorized matrix B can only be used with B\A and not with A/B.

I will try to make a PR for StaticArrays. For the build-in Matrix, I am a bit lost in the code. Should I open an issue for Julia? or even OpenBLAS?

StaticArrays methods (at least) require that the size of the output matrix be knowable from the types of the input arrays. This is not possible if A is a Matrix.

Agree, StaticArrays is missing some methods here. Thanks for trying to make a PR.

But isn’t the size of the output matrix the same as A and thus, if the size of A is known from its type, the complier can know the return value as well?

Ok, I guess I was unclear, I meant that A is (as well as B) a StaticMatrix and not a StaticVector.

Your are right, that is a case I didn’t think of. I’m considering the case of B beeing square and of the dimension 3x3 or smaller. This file includes optimized code for the case that B is 3x3 and A 3x1. And I think it can be extended to A beeing 3xn similar to the function funDiv2.