Independent LU factorization of small matrices not faster with threads

My problem is that I want to solve many small linear systems of size 3x3, and to do it in parallel, using LU factorization.

using LinearAlgebra
using TimerOutputs
using Random

reset_timer!()

function f_lu(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "Inverse LAPACK lu" begin
        for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_lu_inbounds(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "Inverse LAPACK lu inbounds" begin
        @inbounds for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_lu_threads(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "Inverse LAPACK lu threads" begin
        Threads.@threads for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_lu_julia(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    x = zeros(3,N)
    @timeit "Inverse lu julia" begin
        for n = 1:N
            @views ldiv!(x[:,n],  lu!(A[:,:,n]), b[:,n])
        end
    end
    x
end

N = 4000*10^4
fs = [f_lu_inbounds, f_lu, f_lu_julia, f_lu_threads]
res = map(f ->f(N), fs)
for n = 2:length(fs)
    @assert res[n] β‰ˆ res[1]
end

print_timer()

yields

julia> include("test_lapack.jl")
 ─────────────────────────────────────────────────────────────────────────────────────
                                              Time                   Allocations      
                                      ──────────────────────   ───────────────────────
           Tot / % measured:                291s / 37.1%           45.3GiB / 60.5%    

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 Inverse lu julia                  1    34.8s  32.2%   34.8s   14.9GiB  54.3%  14.9GiB
 Inverse LAPACK lu threads         1    28.1s  26.0%   28.1s   4.17GiB  15.2%  4.17GiB
 Inverse LAPACK lu                 1    25.4s  23.5%   25.4s   4.17GiB  15.2%  4.17GiB
 Inverse LAPACK lu inbounds        1    19.7s  18.3%   19.7s   4.17GiB  15.2%  4.17GiB
 ─────────────────────────────────────────────────────────────────────────────────────
julia> Threads.nthreads()
2

So first I was surprised by calling Lapack routing directly saved 30% in time. Secondly, even though I used two threads to solve the same problem, the computational time increased instead of decreasing, so probably I have misunderstood how threads work.

How would I do to speed up this type of calculations?

Set the number of BLAS threads to 1. BLAS.set_num_threads(1)

 ─────────────────────────────────────────────────────────────────────────────────────
                                              Time                   Allocations
                                      ──────────────────────   ───────────────────────
           Tot / % measured:               14.6s / 71.5%           3.49GiB / 47.9%

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 Inverse LAPACK lu inbounds        1    4.84s  46.3%   4.84s    427MiB  25.0%   427MiB
 Inverse lu julia                  1    2.42s  23.2%   2.42s    427MiB  25.0%   427MiB
 Inverse LAPACK lu                 1    1.80s  17.2%   1.80s    427MiB  25.0%   427MiB
 Inverse LAPACK lu threads         1    1.40s  13.4%   1.40s    429MiB  25.1%   429MiB
 ─────────────────────────────────────────────────────────────────────────────────────

Anemic Surface 4 with 2 cores. 2 threads.

BTW: if the matrices are so small, I’d consider StaticArrays.

2 Likes

Use StaticArrays.jl instead. For such small matrices, LAPACK has a huge overhead β€” you really just want to unroll the whole computation.

2 Likes

Piggybacking off of @stevengj, we’ve found that StaticArrays is the fastest for < 16x16 or so, then RecursiveFactorization.jl (https://github.com/YingboMa/RecursiveFactorization.jl) is the fastest until about 200x200 (unless MKL), and then OpenBLAS is finally acceptable. So essentially, avoid OpenBLAS as much as you can here.

3 Likes

Yes indeed. I guess for 3x3 matrix inversion, it is just reasonable to just calculate the inverse explicitly.

using LinearAlgebra
using TimerOutputs
using Random
using StaticArrays

function f_1(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "LAPACK LU" begin
        @inbounds for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_2(N)
    Random.seed!(1)
    A = [@SMatrix randn(3,3) for n = 1:N]
    b = [@SVector randn(3) for n = 1:N]
    x = [@MVector zeros(3) for n = 1:N]
    @timeit "StaticArrays" begin
        @inbounds for n = 1:N
            x[n] .= A[n] \ b[n]
        end
    end
    xr = zeros(3,N)
    for n = 1:N
        xr[:,n]  = convert(Array, x[n])
    end
    xr
end


reset_timer!()
println("Compile")
fs = [f_1, f_2]
foreach(f -> f(1), fs)

reset_timer!()
println("Run")
N = 10^7
res = map(f -> f(N), fs)

for n = 2:length(fs)
    @assert res[n] β‰ˆ res[1]
end

print_timer()

julia> include("test_inverting_3x3.jl")
Compile
Run
 ───────────────────────────────────────────────────────────────────────
                                Time                   Allocations      
                        ──────────────────────   ───────────────────────
    Tot / % measured:        7.76s / 44.1%           4.71GiB / 22.1%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────
 LAPACK LU           1    3.29s  96.1%   3.29s   1.04GiB  100%   1.04GiB
 StaticArrays        1    132ms  3.86%   132ms     0.00B  0.00%    0.00B
 ───────────────────────────────────────────────────────────────────────
julia> 

Can be SVector, not MVector. And you don’t need .=, just =. The reason is that you don’t need to mutate the entries of x[n], and can instead replace the whole vector.

1 Like

I’m curious how well will MKL be for repeating operations using its JIT interface with Direct Call.
I’d bet it can be very competitive with RecursiveFactorization.jl and maybe even StaticArrays.jl?

If we did things right then it’ll match. If we didn’t, then we’ll benchmark it and find out how to make it the same. So yes, it’s worth benchmarking.

Many thanks for all the answers!