Performance inv & solve in Julia & R

PharmCat · August 9, 2020, 4:22pm

How can it be:

Matrix inv(A) where A 150x150 matrix twice slower (2.643 ms) than in R (1.049 ms). But invercing by blocks in R ( 3.275 ms) slower than in Julia (58.917 μs) many times. Second not surprising, but first… How peroformance like in R can be achived?

using RCall
using BlockDiagonals
using LinearAlgebra
using BenchmarkTools

blocks  = [rand(3,3) for i in 1:50]
BDm     = BlockDiagonal(blocks)
V       = Matrix(BDm)
sBDm    = BlockDiagonal(sblocks)
@rput V
@rput blocks

b1 = @benchmark inv(V)
b2 = @benchmark inv.(blocks)
b3 = @benchmark inv(BDm)
b4 = @benchmark R"solve(V)"
b5 = @benchmark R"for (i in 1:50) {solve(blocks[[i]])}"

#=
julia> b1
BenchmarkTools.Trial:
  memory estimate:  252.41 KiB
  allocs estimate:  6
  --------------
  minimum time:     1.297 ms (0.00% GC)
  median time:      1.610 ms (0.00% GC)
  mean time:        2.643 ms (0.29% GC)
  maximum time:     305.198 ms (0.00% GC)
  --------------
  samples:          1888
  evals/sample:     1

julia> b2
BenchmarkTools.Trial:
  memory estimate:  99.73 KiB
  allocs estimate:  253
  --------------
  minimum time:     32.200 μs (0.00% GC)
  median time:      39.999 μs (0.00% GC)
  mean time:        58.917 μs (9.61% GC)
  maximum time:     4.830 ms (98.32% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> b3
BenchmarkTools.Trial:
  memory estimate:  99.73 KiB
  allocs estimate:  253
  --------------
  minimum time:     31.900 μs (0.00% GC)
  median time:      39.299 μs (0.00% GC)
  mean time:        56.484 μs (9.13% GC)
  maximum time:     4.496 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> b4
BenchmarkTools.Trial:
  memory estimate:  1.06 KiB
  allocs estimate:  30
  --------------
  minimum time:     637.901 μs (0.00% GC)
  median time:      783.800 μs (0.00% GC)
  mean time:        1.049 ms (0.00% GC)
  maximum time:     32.893 ms (0.00% GC)
  --------------
  samples:          4713
  evals/sample:     1

julia> b5
BenchmarkTools.Trial:
  memory estimate:  1.06 KiB
  allocs estimate:  30
  --------------
  minimum time:     1.916 ms (0.00% GC)
  median time:      2.508 ms (0.00% GC)
  mean time:        3.275 ms (0.00% GC)
  maximum time:     26.751 ms (0.00% GC)
  --------------
  samples:          1520
  evals/sample:     1

julia>
=#

Oscar_Smith · August 9, 2020, 4:27pm

This isn’t an answer, but it is worth noting that inv it’s almost always the wrong tool to use.

PharmCat · August 9, 2020, 4:32pm

Hm… what should i use to get inverted matrix?

Oscar_Smith · August 9, 2020, 4:38pm

In general, you should use b\a instead of inv(a)*b. If you need to store it for repeated use, you should use a factorization. This will typically be faster, but also leads to more accurate results.

PharmCat · August 9, 2020, 5:11pm

I’m using inverted matrix in operations:
A *inv(B)*C
A *inv(B)*y
r’*inv(B)*r

For A\B i have following results: less memory more time…

A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]
B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10]
function f1(A,B)
    A\B
end
function f2(A,B)
    inv(A)*B
end
#=
ulia> @benchmark f1(A,B)
BenchmarkTools.Trial:
  memory estimate:  560 bytes
  allocs estimate:  4
  --------------
  minimum time:     19.501 μs (0.00% GC)
  median time:      30.900 μs (0.00% GC)
  mean time:        36.059 μs (0.00% GC)
  maximum time:     824.400 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f2(A,B)
BenchmarkTools.Trial:
  memory estimate:  3.13 KiB
  allocs estimate:  11
  --------------
  minimum time:     1.120 μs (0.00% GC)
  median time:      1.280 μs (0.00% GC)
  mean time:        2.035 μs (13.55% GC)
  maximum time:     572.610 μs (98.95% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia>
=#

ctkelley · August 9, 2020, 5:56pm

Try B=lu!(A). That stores the factorization in A and the permutation vector in part of the LU structure B. The you can do your solves with B\b and reuse the factorization. Your function does A\b everytime and therefore repeats the factorization. B and A share storage for the matrix.

julia> A=[1.0 2.0; 3.0 4.0];
julia> B=lu!(A);
julia> typeof(B)
LU{Float64,Array{Float64,2}}
julia> typeof(A)
Array{Float64,2}

This paradigm is exactly the way LINPACK did it in Fortran 77.

PharmCat · August 9, 2020, 10:16pm

I made some variants. Variant with m1 = B.U\(B.L\A) may be little better, but I surpriesed that ldiv!(m1, B, A) have lack of performance.

using BenchmarkTools, LinearAlgebra
"""
A' * B * A -> +θ 
A' * B * C -> +β
"""
function mulθβinc!(θ, β, A::AbstractMatrix, B::AbstractMatrix, C::Vector)
    q = size(B, 1)
    p = size(A, 2)
    c = zeros(eltype(B), q)
    for i = 1:p
        c .= 0
        for n = 1:q
            for m = 1:q
                @inbounds c[n] += B[m, n] * A[m, i]
            end
            @inbounds β[i] += c[n] * C[n]
        end
        for n = 1:p
            for m = 1:q
                @inbounds θ[i, n] += A[m, n] * c[m]
            end
        end
    end
    θ, β
end

function mulsimple!(θ, β, A::AbstractMatrix, B, C::Vector)
    q = size(B, 1)
    p = size(A, 2)
    m1 = B.U\(B.L\A)
    m2 = B.U\(B.L\C)
    for m = 1:p
        for n = 1:p
            for i = 1:q
                @inbounds θ[m, n] += A[i, m] * m1[i, n]
            end
        end
        for i = 1:q
            @inbounds β[m] += A[i, m] * m2[i]
        end
    end
    θ, β
end

function mulsimple2!(θ, β, A::AbstractMatrix, B, C::Vector)
    q = size(B, 1)
    p = size(A, 2)
    m1 = zeros(q, p)
    m2 = zeros(q)
    ldiv!(m1, B, A)
    ldiv!(m2, B, C)
    for m = 1:p
        for n = 1:p
            for i = 1:q
                @inbounds θ[m, n] += A[i, m] * m1[i, n]
            end
        end
        for i = 1:q
            @inbounds β[m] += A[i, m] * m2[i]
        end
    end
    θ, β
end


A = [1 3 4 20 3; 2 3 2 3 4; 3 2 5 6 6; 3 4 5 23.5 6]
C = [3, 4, 5, 23.5]
B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10.0]

function f1(θ,β,A,B,C)
    mulθβinc!(θ, β, A, inv(B), C)
end
function f2(θ,β,A,B,C)
    mulsimple!(θ, β, A, lu(B, Val(false)), C)
end
function f3(θ,β,A,B,C)
    mulsimple2!(θ, β, A, lu(B, Val(false)), C)
end

θ = zeros(5, 5)
β = zeros(5)
b1 =  @benchmark f1(θ,β,A,B,C)
b2 =  @benchmark f2(θ,β,A,B,C)
b3 =  @benchmark f3(θ,β,A,B,C)

result:

julia> b1
BenchmarkTools.Trial:
  memory estimate:  2.64 KiB
  allocs estimate:  7
  --------------
  minimum time:     1.050 μs (0.00% GC)
  median time:      1.110 μs (0.00% GC)
  mean time:        2.578 μs (17.18% GC)
  maximum time:     1.136 ms (99.62% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> b2
BenchmarkTools.Trial:
  memory estimate:  1.88 KiB
  allocs estimate:  12
  --------------
  minimum time:     1.540 μs (0.00% GC)
  median time:      2.090 μs (0.00% GC)
  mean time:        2.951 μs (5.30% GC)
  maximum time:     780.390 μs (98.97% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> b3
BenchmarkTools.Trial:
  memory estimate:  960 bytes
  allocs estimate:  10
  --------------
  minimum time:     21.999 μs (0.00% GC)
  median time:      30.800 μs (0.00% GC)
  mean time:        35.822 μs (0.00% GC)
  maximum time:     977.000 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia>

ctkelley · August 9, 2020, 11:20pm

It’s easier than that. You don’t have to do B.L … All that you need is

B\b

and Julia manages the factors fast and with less allocation.

WschW · August 9, 2020, 11:27pm

Most of these benchmarks use really small matrices which have different performance characteristics then larger ones. I would suggest testing with much larger matrices matrices.

If on the other hand these matrix sizes are representative of what you are trying to do I would suggest using either SMatrix or MMatrix from StaticArrays.jl

PharmCat · August 10, 2020, 8:12am

I’m using small matrices in work… and make tests for samll matreces, but in bigger results not so helpful… if i make inv(V) one time and use it at least twise it is better than make A\V each time. LU decomposition help a little, and I think i should use it, but I don’t see big performance increase. One strange thing:

Seems R somehow know about structure and make optimizations… or something else… Why in Julia inv() not works faster without direct structuring?

Am = rand(150, 150)
Bm = rand(150, 150)
@rput Am
@rput Bm
blocks  = [rand(3,3) for i in 1:50]
BDm     = BlockDiagonal(blocks)
V       = Matrix(BDm)
@rput V
b1 = @benchmark R"solve(V)"
b2 = @benchmark R"solve(Bm)"

julia> b1
BenchmarkTools.Trial:
  memory estimate:  1.08 KiB
  allocs estimate:  31
  --------------
  minimum time:     632.999 μs (0.00% GC)
  median time:      795.801 μs (0.00% GC)
  mean time:        1.109 ms (0.00% GC)
  maximum time:     31.604 ms (0.00% GC)
  --------------
  samples:          4464
  evals/sample:     1

julia> b2
BenchmarkTools.Trial:
  memory estimate:  1.08 KiB
  allocs estimate:  31
  --------------
  minimum time:     2.414 ms (0.00% GC)
  median time:      3.076 ms (0.00% GC)
  mean time:        3.751 ms (0.00% GC)
  maximum time:     28.036 ms (0.00% GC)
  --------------
  samples:          1329
  evals/sample:     1

julia>

mschauer · August 10, 2020, 8:47am

Besides helping @PharmCat to find the right approach in general let’s also try to answer the actual question: Why is inv slower than R’s solve for these matrices? Am I right that @PharmCat is basically profiling different LAPACK implementations and R might be using a fast one?

DNF · August 10, 2020, 10:20am

PharmCat:

function f1(A,B)
    A\B
end
function f2(A,B)
    inv(A)*B
end
#=
ulia> @benchmark f1(A,B)
BenchmarkTools.Trial:
  memory estimate:  560 bytes
  allocs estimate:  4
  --------------
  minimum time:     19.501 μs (0.00% GC)
[...]
julia> @benchmark f2(A,B)
BenchmarkTools.Trial:
  memory estimate:  3.13 KiB
  allocs estimate:  11
  --------------
  minimum time:     1.120 μs (0.00% GC)

What version of Julia are you using? You have almost 20x performance difference here, between f1 and f2, that doesn’t seem right. I get only a 30% difference.

Also, when benchmarking, make sure to interpolate the variables in the benchmarking expressions:

@benchmark f1($A, $B)
@benchmark f2($A, $B)

Finally, if you are consistently using small matrices and vectors, where you know the sizes beforehand, you should definitely look at the StaticArray package, like @WschW suggests. You will probably get dramatic speedups that way, at least 10x I believe.

mcabbott · August 10, 2020, 12:04pm

I think these are equivalent to f1(A,B) and f2(A,B) above, but are much closer together for me, and using MKL.jl helps a lot. (Not sure about R but this is python numpy’s default.)

julia> @btime A\B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  295.248 ns (3 allocations: 528 bytes)

julia> @btime inv(A)*B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  610.280 ns (10 allocations: 1.08 KiB)

julia> using LinearAlgebra; BLAS.vendor()
:mkl

vs

julia> @btime A\B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  1.322 μs (4 allocations: 560 bytes)

julia> @btime inv(A)*B setup = begin A = [1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]; B = [1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10] end;
  1.127 μs (11 allocations: 3.13 KiB)

julia> using LinearAlgebra; BLAS.vendor()
:openblas64

Edit: with StaticArrays for completeness:

julia> using StaticArrays

julia> @btime A[]\B[] setup = begin  A = Ref(SA[1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]); B = Ref(SA[1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10]) end;
  111.477 ns (0 allocations: 0 bytes)

julia> @btime inv(A[]) * B[] setup = begin  A = Ref(SA[1 3 4 20; 2 3 2 3; 3 2 5 6; 3 4 5 23]); B = Ref(SA[1 3 4 3; 2 3 2 3; 3 4 20 6; 3 4 5 10]) end;
  28.541 ns (0 allocations: 0 bytes)

PharmCat · August 11, 2020, 6:07pm

Thanks to all!
@mcabbott, @WschW - yes, StaticArrays can help whe matrices if it smaller than 14x14 (and can’t be used for 150x150 matrices). For this matrices fastest method is inv(SMatrix{m,m}(M)). I don’t know why, but if I dont’t use Matrix(inv(SMatrix{m,m}(M))) I have big (2 - 3 times) memory consumption in subsequent calculations.
Big block-diagonal matrices is still inversing slower than in R. Maybe MKL.jl will helps.

DNF · August 11, 2020, 8:05pm

Can you make a block matrix with StaticArrays for blocks?

julia> blocks  = [rand(3,3) for i in 1:50];

julia> B = BlockDiagonal(blocks);

julia> blocks  = [@SMatrix rand(3,3) for i in 1:50];

julia> Bs = BlockDiagonal(blocks);

julia> @benchmark inv($B)
BenchmarkTools.Trial: 
  memory estimate:  98.14 KiB
  allocs estimate:  201
  --------------
  minimum time:     41.856 μs (0.00% GC)
  median time:      42.481 μs (0.00% GC)
  mean time:        46.913 μs (4.77% GC)
  maximum time:     1.552 ms (95.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark inv($Bs)
BenchmarkTools.Trial: 
  memory estimate:  3.69 KiB
  allocs estimate:  1
  --------------
  minimum time:     854.489 ns (0.00% GC)
  median time:      916.574 ns (0.00% GC)
  mean time:        1.051 μs (10.00% GC)
  maximum time:     43.197 μs (94.81% GC)
  --------------
  samples:          10000
  evals/sample:     47

PharmCat · August 11, 2020, 9:26pm

@DNF, from one side I can do this. And in most cases it will be best solution.
In my case I’m doing this just with Vector of matrices - it’s work with same performance as BlockDiagonal. I don’t check how solve.block in R works… may be it works not worse than inverce BlockDiagonal in Julia.

Topic		Replies	Views
Computing Inverse of a stack of matrices General Usage performance , linearalgebra	19	3608	August 29, 2020
Non-allocating matrix inversion Performance matrices	13	3580	April 23, 2022
How to call blas getrf, getri properly ? i want to create a benchmark inverse matrix using gauss, crout, native julia inv(A) and BLAS direct New to Julia question , blas , mkl , benchmark , openblas	5	128	May 7, 2025
Block diagonal and sparse matrices Numerics linearalgebra	13	1830	October 30, 2023
LinearAlgebra.inv matrix inversion slower than MATLAB? Performance performance	21	2149	October 25, 2021

Performance inv & solve in Julia & R

Related topics