Performance inv & solve in Julia & R

How can it be:

Matrix inv(A) where A 150x150 matrix twice slower (2.643 ms) than in R (1.049 ms). But invercing by blocks in R ( 3.275 ms) slower than in Julia (58.917 μs) many times. Second not surprising, but first… How peroformance like in R can be achived?

using RCall
using BlockDiagonals
using LinearAlgebra
using BenchmarkTools

blocks  = [rand(3,3) for i in 1:50]
BDm     = BlockDiagonal(blocks)
V       = Matrix(BDm)
sBDm    = BlockDiagonal(sblocks)
@rput V
@rput blocks

b1 = @benchmark inv(V)
b2 = @benchmark inv.(blocks)
b3 = @benchmark inv(BDm)
b4 = @benchmark R"solve(V)"
b5 = @benchmark R"for (i in 1:50) {solve(blocks[[i]])}"

#=
julia> b1
BenchmarkTools.Trial:
  memory estimate:  252.41 KiB
  allocs estimate:  6
  --------------
  minimum time:     1.297 ms (0.00% GC)
  median time:      1.610 ms (0.00% GC)
  mean time:        2.643 ms (0.29% GC)
  maximum time:     305.198 ms (0.00% GC)
  --------------
  samples:          1888
  evals/sample:     1

julia> b2
BenchmarkTools.Trial:
  memory estimate:  99.73 KiB
  allocs estimate:  253
  --------------
  minimum time:     32.200 μs (0.00% GC)
  median time:      39.999 μs (0.00% GC)
  mean time:        58.917 μs (9.61% GC)
  maximum time:     4.830 ms (98.32% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> b3
BenchmarkTools.Trial:
  memory estimate:  99.73 KiB
  allocs estimate:  253
  --------------
  minimum time:     31.900 μs (0.00% GC)
  median time:      39.299 μs (0.00% GC)
  mean time:        56.484 μs (9.13% GC)
  maximum time:     4.496 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> b4
BenchmarkTools.Trial:
  memory estimate:  1.06 KiB
  allocs estimate:  30
  --------------
  minimum time:     637.901 μs (0.00% GC)
  median time:      783.800 μs (0.00% GC)
  mean time:        1.049 ms (0.00% GC)
  maximum time:     32.893 ms (0.00% GC)
  --------------
  samples:          4713
  evals/sample:     1

julia> b5
BenchmarkTools.Trial:
  memory estimate:  1.06 KiB
  allocs estimate:  30
  --------------
  minimum time:     1.916 ms (0.00% GC)
  median time:      2.508 ms (0.00% GC)
  mean time:        3.275 ms (0.00% GC)
  maximum time:     26.751 ms (0.00% GC)
  --------------
  samples:          1520
  evals/sample:     1

julia>
=#
1 Like

This isn’t an answer, but it is worth noting that inv it’s almost always the wrong tool to use.

1 Like

Hm… what should i use to get inverted matrix? :upside_down_face:

In general, you should use b\a instead of inv(a)*b. If you need to store it for repeated use, you should use a factorization. This will typically be faster, but also leads to more accurate results.

I’m using inverted matrix in operations:
A *inv(B)*C
A *inv(B)*y
r’*inv(B)*r

For A\B i have following results: less memory more time…

A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]
B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10]
function f1(A,B)
    A\B
end
function f2(A,B)
    inv(A)*B
end
#=
ulia> @benchmark f1(A,B)
BenchmarkTools.Trial:
  memory estimate:  560 bytes
  allocs estimate:  4
  --------------
  minimum time:     19.501 μs (0.00% GC)
  median time:      30.900 μs (0.00% GC)
  mean time:        36.059 μs (0.00% GC)
  maximum time:     824.400 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f2(A,B)
BenchmarkTools.Trial:
  memory estimate:  3.13 KiB
  allocs estimate:  11
  --------------
  minimum time:     1.120 μs (0.00% GC)
  median time:      1.280 μs (0.00% GC)
  mean time:        2.035 μs (13.55% GC)
  maximum time:     572.610 μs (98.95% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia>
=#

Try B=lu!(A). That stores the factorization in A and the permutation vector in part of the LU structure B. The you can do your solves with B\b and reuse the factorization. Your function does A\b everytime and therefore repeats the factorization. B and A share storage for the matrix.

julia> A=[1.0 2.0; 3.0 4.0];
julia> B=lu!(A);
julia> typeof(B)
LU{Float64,Array{Float64,2}}
julia> typeof(A)
Array{Float64,2}

This paradigm is exactly the way LINPACK did it in Fortran 77.

4 Likes

I made some variants. Variant with m1 = B.U\(B.L\A) may be little better, but I surpriesed that ldiv!(m1, B, A) have lack of performance.

using BenchmarkTools, LinearAlgebra
"""
A' * B * A -> +θ 
A' * B * C -> +β
"""
function mulθβinc!(θ, β, A::AbstractMatrix, B::AbstractMatrix, C::Vector)
    q = size(B, 1)
    p = size(A, 2)
    c = zeros(eltype(B), q)
    for i = 1:p
        c .= 0
        for n = 1:q
            for m = 1:q
                @inbounds c[n] += B[m, n] * A[m, i]
            end
            @inbounds β[i] += c[n] * C[n]
        end
        for n = 1:p
            for m = 1:q
                @inbounds θ[i, n] += A[m, n] * c[m]
            end
        end
    end
    θ, β
end

function mulsimple!(θ, β, A::AbstractMatrix, B, C::Vector)
    q = size(B, 1)
    p = size(A, 2)
    m1 = B.U\(B.L\A)
    m2 = B.U\(B.L\C)
    for m = 1:p
        for n = 1:p
            for i = 1:q
                @inbounds θ[m, n] += A[i, m] * m1[i, n]
            end
        end
        for i = 1:q
            @inbounds β[m] += A[i, m] * m2[i]
        end
    end
    θ, β
end

function mulsimple2!(θ, β, A::AbstractMatrix, B, C::Vector)
    q = size(B, 1)
    p = size(A, 2)
    m1 = zeros(q, p)
    m2 = zeros(q)
    ldiv!(m1, B, A)
    ldiv!(m2, B, C)
    for m = 1:p
        for n = 1:p
            for i = 1:q
                @inbounds θ[m, n] += A[i, m] * m1[i, n]
            end
        end
        for i = 1:q
            @inbounds β[m] += A[i, m] * m2[i]
        end
    end
    θ, β
end


A = [1 3 4 20 3; 2 3 2 3 4; 3 2 5 6 6; 3 4 5 23.5 6]
C = [3, 4, 5, 23.5]
B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10.0]

function f1(θ,β,A,B,C)
    mulθβinc!(θ, β, A, inv(B), C)
end
function f2(θ,β,A,B,C)
    mulsimple!(θ, β, A, lu(B, Val(false)), C)
end
function f3(θ,β,A,B,C)
    mulsimple2!(θ, β, A, lu(B, Val(false)), C)
end

θ = zeros(5, 5)
β = zeros(5)
b1 =  @benchmark f1(θ,β,A,B,C)
b2 =  @benchmark f2(θ,β,A,B,C)
b3 =  @benchmark f3(θ,β,A,B,C)

result:

julia> b1
BenchmarkTools.Trial:
  memory estimate:  2.64 KiB
  allocs estimate:  7
  --------------
  minimum time:     1.050 μs (0.00% GC)
  median time:      1.110 μs (0.00% GC)
  mean time:        2.578 μs (17.18% GC)
  maximum time:     1.136 ms (99.62% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> b2
BenchmarkTools.Trial:
  memory estimate:  1.88 KiB
  allocs estimate:  12
  --------------
  minimum time:     1.540 μs (0.00% GC)
  median time:      2.090 μs (0.00% GC)
  mean time:        2.951 μs (5.30% GC)
  maximum time:     780.390 μs (98.97% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> b3
BenchmarkTools.Trial:
  memory estimate:  960 bytes
  allocs estimate:  10
  --------------
  minimum time:     21.999 μs (0.00% GC)
  median time:      30.800 μs (0.00% GC)
  mean time:        35.822 μs (0.00% GC)
  maximum time:     977.000 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 

It’s easier than that. You don’t have to do B.L … All that you need is

B\b

and Julia manages the factors fast and with less allocation.

1 Like

Most of these benchmarks use really small matrices which have different performance characteristics then larger ones. I would suggest testing with much larger matrices matrices.

If on the other hand these matrix sizes are representative of what you are trying to do I would suggest using either SMatrix or MMatrix from StaticArrays.jl

8 Likes

I’m using small matrices in work… and make tests for samll matreces, but in bigger results not so helpful… if i make inv(V) one time and use it at least twise it is better than make A\V each time. LU decomposition help a little, and I think i should use it, but I don’t see big performance increase. One strange thing:

Seems R somehow know about structure and make optimizations… or something else… Why in Julia inv() not works faster without direct structuring?

Am = rand(150, 150)
Bm = rand(150, 150)
@rput Am
@rput Bm
blocks  = [rand(3,3) for i in 1:50]
BDm     = BlockDiagonal(blocks)
V       = Matrix(BDm)
@rput V
b1 = @benchmark R"solve(V)"
b2 = @benchmark R"solve(Bm)"
julia> b1
BenchmarkTools.Trial:
  memory estimate:  1.08 KiB
  allocs estimate:  31
  --------------
  minimum time:     632.999 μs (0.00% GC)
  median time:      795.801 μs (0.00% GC)
  mean time:        1.109 ms (0.00% GC)
  maximum time:     31.604 ms (0.00% GC)
  --------------
  samples:          4464
  evals/sample:     1

julia> b2
BenchmarkTools.Trial:
  memory estimate:  1.08 KiB
  allocs estimate:  31
  --------------
  minimum time:     2.414 ms (0.00% GC)
  median time:      3.076 ms (0.00% GC)
  mean time:        3.751 ms (0.00% GC)
  maximum time:     28.036 ms (0.00% GC)
  --------------
  samples:          1329
  evals/sample:     1

julia>
1 Like

Besides helping @PharmCat to find the right approach in general let’s also try to answer the actual question: Why is inv slower than R’s solve for these matrices? Am I right that @PharmCat is basically profiling different LAPACK implementations and R might be using a fast one?

4 Likes

What version of Julia are you using? You have almost 20x performance difference here, between f1 and f2, that doesn’t seem right. I get only a 30% difference.

Also, when benchmarking, make sure to interpolate the variables in the benchmarking expressions:

@benchmark f1($A, $B)
@benchmark f2($A, $B)

Finally, if you are consistently using small matrices and vectors, where you know the sizes beforehand, you should definitely look at the StaticArray package, like @WschW suggests. You will probably get dramatic speedups that way, at least 10x I believe.

2 Likes

I think these are equivalent to f1(A,B) and f2(A,B) above, but are much closer together for me, and using MKL.jl helps a lot. (Not sure about R but this is python numpy’s default.)

julia> @btime A\B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  295.248 ns (3 allocations: 528 bytes)

julia> @btime inv(A)*B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  610.280 ns (10 allocations: 1.08 KiB)

julia> using LinearAlgebra; BLAS.vendor()
:mkl

vs

julia> @btime A\B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  1.322 μs (4 allocations: 560 bytes)

julia> @btime inv(A)*B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  1.127 μs (11 allocations: 3.13 KiB)

julia> using LinearAlgebra; BLAS.vendor()
:openblas64

Edit: with StaticArrays for completeness:

julia> using StaticArrays

julia> @btime A[]\B[] setup = begin  A = Ref(SA[1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]); B = Ref(SA[1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10]) end;
  111.477 ns (0 allocations: 0 bytes)

julia> @btime inv(A[]) * B[] setup = begin  A = Ref(SA[1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]); B = Ref(SA[1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10]) end;
  28.541 ns (0 allocations: 0 bytes)
3 Likes

Thanks to all!
@mcabbott, @WschW - yes, StaticArrays can help whe matrices if it smaller than 14x14 (and can’t be used for 150x150 matrices). For this matrices fastest method is inv(SMatrix{m,m}(M)). I don’t know why, but if I dont’t use Matrix(inv(SMatrix{m,m}(M))) I have big (2 - 3 times) memory consumption in subsequent calculations.
Big block-diagonal matrices is still inversing slower than in R. Maybe MKL.jl will helps.

Can you make a block matrix with StaticArrays for blocks?

julia> blocks  = [rand(3,3) for i in 1:50];

julia> B = BlockDiagonal(blocks);

julia> blocks  = [@SMatrix rand(3,3) for i in 1:50];

julia> Bs = BlockDiagonal(blocks);

julia> @benchmark inv($B)
BenchmarkTools.Trial: 
  memory estimate:  98.14 KiB
  allocs estimate:  201
  --------------
  minimum time:     41.856 μs (0.00% GC)
  median time:      42.481 μs (0.00% GC)
  mean time:        46.913 μs (4.77% GC)
  maximum time:     1.552 ms (95.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark inv($Bs)
BenchmarkTools.Trial: 
  memory estimate:  3.69 KiB
  allocs estimate:  1
  --------------
  minimum time:     854.489 ns (0.00% GC)
  median time:      916.574 ns (0.00% GC)
  mean time:        1.051 μs (10.00% GC)
  maximum time:     43.197 μs (94.81% GC)
  --------------
  samples:          10000
  evals/sample:     47
1 Like

@DNF, from one side I can do this. And in most cases it will be best solution.
In my case I’m doing this just with Vector of matrices - it’s work with same performance as BlockDiagonal. I don’t check how solve.block in R works… may be it works not worse than inverce BlockDiagonal in Julia.