Performance of matrix division


#1

I have a question regarding the performance of matrix division (/).
During some calculations, I noticed a huge impact of 3x3-Matrix divisions compared to expectations I had from a similar matlab program. Therefore, I made some test with / as well as \. I noticed significant performance improvements for A/B if I manually step through the lines of A.
Here the Code:

using LinearAlgebra

using BenchmarkTools


function funDiv1(A, B)
    return C = A/B
end

function funDiv2(A,B)
    C = similar(A)
    LineNum = size(A, 1)
    for iL = 1:LineNum
        C[iL, :] = A[iL, :]'/B
    end
    return C
end

A = convert(Matrix{Float64}, [1 3 7; 17 28 33; 2 0 7])
B = convert(Matrix{Float64}, [2 5 23; 1 45 2; 4 2 12])
Bf = factorize(B)

bench1 = @benchmark funDiv1($A, $Bf)
bench2 = @benchmark funDiv2($A, $Bf)

On my system, the median time is about 13 micro seconds for funDiv1 and .5 for funDiv2.
Is there a good reason for the additional computation time or can I use the second function for a performance improvement?


#2

For 3x3 arrays, try use StaticArrays.jl:

julia> using StaticArrays

julia> A = SMatrix{3,3, Float64}([1 3 7; 17 28 33; 2 0 7]);

julia> B = SMatrix{3,3, Float64}([2 5 23; 1 45 2; 4 2 12]);

julia> @btime $A / $B
  105.613 ns (0 allocations: 0 bytes)
3×3 SArray{Tuple{3,3},Float64,2,9}:
  0.237344    0.0348455  0.122617
 -1.02761     0.530572   4.63116
  0.0571992  -0.0276134  0.478304

#3

Thanks, this is indeed a hugh improvement. I now get:

julia> A = SMatrix{3,3, ComplexF64}([1 3 7; 17 28 33; 2 0 7]);

julia> B = SMatrix{3,3, ComplexF64}([2 5 23; 1 45 2; 4 2 12]);

julia> @benchmark funDiv1($A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     579.459 ns (0.00% GC)
  median time:      580.541 ns (0.00% GC)
  mean time:        604.950 ns (0.00% GC)
  maximum time:     1.915 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     185

julia> @benchmark funDiv2($A, $B)
BenchmarkTools.Trial:
  memory estimate:  160 bytes
  allocs estimate:  1
  --------------
  minimum time:     353.521 ns (0.00% GC)
  median time:      387.324 ns (0.00% GC)
  mean time:        405.726 ns (0.00% GC)
  maximum time:     1.650 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     213

So the second function is still faster. Is there a downside for this approach?


#4

Right now the implementation for A/B works by computing (B' \ A')', which adds some overhead for small matrices. It should be straightforward to write a more specialized A/B method, and this would make a good PR.

No, I take that back: A \ B and A / B take about the same time even for 3x3 matrices, which indicates that the transposes aren’t the problem.

For the built-in Matrix type, it could be an OpenBLAS issue — OpenBLAS is optimized for large matrix–matrix operations, and so when it sees the tridiagonal solve that arises from A / B it may spend a lot of overhead deciding on the optimal blocking etcetera.

StaticArrays avoids all of this kind of overhead for small matrices, however, so if A / B is still slower than operating column-by-column with an SMatrix then this might be an issue in StaticArrays.


#5

After looking deeper into the code of StaticArrays, I think I found the cause. solve.jl introduces specialized functions for calculations with vectors up to the length of 3. But these are not applied when A (in B\A) is a Matrix.

If the dimensions of B is 4 or higher, the performance advantage changes and funDiv1 is faster than funDiv2. Though, this is only true for StaticArrays. I tested up to a dimension of 50 and the performance of Div2 is still better than Div1 for normal Arrays.

Another issue I found with StaticArrays: While it supports lu-factorization, the factorized matrix B can only be used with B\A and not with A/B.

I will try to make a PR for StaticArrays. For the build-in Matrix, I am a bit lost in the code. Should I open an issue for Julia? or even OpenBLAS?


#6

StaticArrays methods (at least) require that the size of the output matrix be knowable from the types of the input arrays. This is not possible if A is a Matrix.

Agree, StaticArrays is missing some methods here. Thanks for trying to make a PR.

By the way, you can do:

A = Float64.(@SMatrix [1 3 7; 17 28 33; 2 0 7])
B = Float64.(@SMatrix [2 5 23; 1 45 2; 4 2 12])

to construct the matrices more efficiently and concisely.


#7

But isn’t the size of the output matrix the same as A and thus, if the size of A is known from its type, the complier can know the return value as well?


#8

Not in general, since \ also applies to non-square matrices:

julia> B = rand(2, 3); A = rand(2, 4);

julia> B \ A
3×4 Array{Float64,2}:
  0.433586  0.609795   0.172143   0.345393
  0.484587  0.609794   0.0237078  0.338466
 -0.315217  0.0503478  1.0358     0.0761907

Also, we were considering the case when A is a Matrix, so the size of A is not known from its type.


#9

Ok, I guess I was unclear, I meant that A is (as well as B) a StaticMatrix and not a StaticVector.

Your are right, that is a case I didn’t think of. I’m considering the case of B beeing square and of the dimension 3x3 or smaller.
This file includes optimized code for the case that B is 3x3 and A 3x1. And I think it can be extended to A beeing 3xn similar to the function funDiv2.