# Independent LU factorization of small matrices not faster with threads

My problem is that I want to solve many small linear systems of size 3x3, and to do it in parallel, using LU factorization.

``````using LinearAlgebra
using TimerOutputs
using Random

reset_timer!()

function f_lu(N)
Random.seed!(1)
A = randn(3,3,N)
b = randn(3,N)
@timeit "Inverse LAPACK lu" begin
for n = 1:N
@views LAPACK.gesv!(A[:,:,n], b[:,n])
end
end
b
end
function f_lu_inbounds(N)
Random.seed!(1)
A = randn(3,3,N)
b = randn(3,N)
@timeit "Inverse LAPACK lu inbounds" begin
@inbounds for n = 1:N
@views LAPACK.gesv!(A[:,:,n], b[:,n])
end
end
b
end
Random.seed!(1)
A = randn(3,3,N)
b = randn(3,N)
@timeit "Inverse LAPACK lu threads" begin
@views LAPACK.gesv!(A[:,:,n], b[:,n])
end
end
b
end
function f_lu_julia(N)
Random.seed!(1)
A = randn(3,3,N)
b = randn(3,N)
x = zeros(3,N)
@timeit "Inverse lu julia" begin
for n = 1:N
@views ldiv!(x[:,n],  lu!(A[:,:,n]), b[:,n])
end
end
x
end

N = 4000*10^4
fs = [f_lu_inbounds, f_lu, f_lu_julia, f_lu_threads]
res = map(f ->f(N), fs)
for n = 2:length(fs)
@assert res[n] β res[1]
end

print_timer()
``````

yields

``````julia> include("test_lapack.jl")
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time                   Allocations
ββββββββββββββββββββββ   βββββββββββββββββββββββ
Tot / % measured:                291s / 37.1%           45.3GiB / 60.5%

Section                      ncalls     time   %tot     avg     alloc   %tot      avg
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Inverse lu julia                  1    34.8s  32.2%   34.8s   14.9GiB  54.3%  14.9GiB
Inverse LAPACK lu threads         1    28.1s  26.0%   28.1s   4.17GiB  15.2%  4.17GiB
Inverse LAPACK lu                 1    25.4s  23.5%   25.4s   4.17GiB  15.2%  4.17GiB
Inverse LAPACK lu inbounds        1    19.7s  18.3%   19.7s   4.17GiB  15.2%  4.17GiB
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2
``````

So first I was surprised by calling Lapack routing directly saved 30% in time. Secondly, even though I used two threads to solve the same problem, the computational time increased instead of decreasing, so probably I have misunderstood how threads work.

How would I do to speed up this type of calculations?

Set the number of BLAS threads to 1. `BLAS.set_num_threads(1)`

`````` βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time                   Allocations
ββββββββββββββββββββββ   βββββββββββββββββββββββ
Tot / % measured:               14.6s / 71.5%           3.49GiB / 47.9%

Section                      ncalls     time   %tot     avg     alloc   %tot      avg
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Inverse LAPACK lu inbounds        1    4.84s  46.3%   4.84s    427MiB  25.0%   427MiB
Inverse lu julia                  1    2.42s  23.2%   2.42s    427MiB  25.0%   427MiB
Inverse LAPACK lu                 1    1.80s  17.2%   1.80s    427MiB  25.0%   427MiB
Inverse LAPACK lu threads         1    1.40s  13.4%   1.40s    429MiB  25.1%   429MiB
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
``````

Anemic Surface 4 with 2 cores. 2 threads.

BTW: if the matrices are so small, Iβd consider StaticArrays.

2 Likes

Use StaticArrays.jl instead. For such small matrices, LAPACK has a huge overhead β you really just want to unroll the whole computation.

2 Likes

Piggybacking off of @stevengj, weβve found that StaticArrays is the fastest for < 16x16 or so, then RecursiveFactorization.jl (https://github.com/YingboMa/RecursiveFactorization.jl) is the fastest until about 200x200 (unless MKL), and then OpenBLAS is finally acceptable. So essentially, avoid OpenBLAS as much as you can here.

3 Likes

Yes indeed. I guess for 3x3 matrix inversion, it is just reasonable to just calculate the inverse explicitly.

``````using LinearAlgebra
using TimerOutputs
using Random
using StaticArrays

function f_1(N)
Random.seed!(1)
A = randn(3,3,N)
b = randn(3,N)
@timeit "LAPACK LU" begin
@inbounds for n = 1:N
@views LAPACK.gesv!(A[:,:,n], b[:,n])
end
end
b
end
function f_2(N)
Random.seed!(1)
A = [@SMatrix randn(3,3) for n = 1:N]
b = [@SVector randn(3) for n = 1:N]
x = [@MVector zeros(3) for n = 1:N]
@timeit "StaticArrays" begin
@inbounds for n = 1:N
x[n] .= A[n] \ b[n]
end
end
xr = zeros(3,N)
for n = 1:N
xr[:,n]  = convert(Array, x[n])
end
xr
end

reset_timer!()
println("Compile")
fs = [f_1, f_2]
foreach(f -> f(1), fs)

reset_timer!()
println("Run")
N = 10^7
res = map(f -> f(N), fs)

for n = 2:length(fs)
@assert res[n] β res[1]
end

print_timer()

julia> include("test_inverting_3x3.jl")
Compile
Run
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time                   Allocations
ββββββββββββββββββββββ   βββββββββββββββββββββββ
Tot / % measured:        7.76s / 44.1%           4.71GiB / 22.1%

Section        ncalls     time   %tot     avg     alloc   %tot      avg
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LAPACK LU           1    3.29s  96.1%   3.29s   1.04GiB  100%   1.04GiB
StaticArrays        1    132ms  3.86%   132ms     0.00B  0.00%    0.00B
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
julia>
``````

Can be SVector, not MVector. And you donβt need .=, just =. The reason is that you donβt need to mutate the entries of x[n], and can instead replace the whole vector.

1 Like

Iβm curious how well will MKL be for repeating operations using its `JIT` interface with `Direct Call`.
Iβd bet it can be very competitive with `RecursiveFactorization.jl` and maybe even `StaticArrays.jl`?

If we did things right then itβll match. If we didnβt, then weβll benchmark it and find out how to make it the same. So yes, itβs worth benchmarking.

Many thanks for all the answers!