Independent LU factorization of small matrices not faster with threads

hcarlsso · October 3, 2020, 2:18pm

My problem is that I want to solve many small linear systems of size 3x3, and to do it in parallel, using LU factorization.

using LinearAlgebra
using TimerOutputs
using Random

reset_timer!()

function f_lu(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "Inverse LAPACK lu" begin
        for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_lu_inbounds(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "Inverse LAPACK lu inbounds" begin
        @inbounds for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_lu_threads(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "Inverse LAPACK lu threads" begin
        Threads.@threads for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_lu_julia(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    x = zeros(3,N)
    @timeit "Inverse lu julia" begin
        for n = 1:N
            @views ldiv!(x[:,n],  lu!(A[:,:,n]), b[:,n])
        end
    end
    x
end

N = 4000*10^4
fs = [f_lu_inbounds, f_lu, f_lu_julia, f_lu_threads]
res = map(f ->f(N), fs)
for n = 2:length(fs)
    @assert res[n] ≈ res[1]
end

print_timer()

yields

julia> include("test_lapack.jl")
 ─────────────────────────────────────────────────────────────────────────────────────
                                              Time                   Allocations      
                                      ──────────────────────   ───────────────────────
           Tot / % measured:                291s / 37.1%           45.3GiB / 60.5%    

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 Inverse lu julia                  1    34.8s  32.2%   34.8s   14.9GiB  54.3%  14.9GiB
 Inverse LAPACK lu threads         1    28.1s  26.0%   28.1s   4.17GiB  15.2%  4.17GiB
 Inverse LAPACK lu                 1    25.4s  23.5%   25.4s   4.17GiB  15.2%  4.17GiB
 Inverse LAPACK lu inbounds        1    19.7s  18.3%   19.7s   4.17GiB  15.2%  4.17GiB
 ─────────────────────────────────────────────────────────────────────────────────────
julia> Threads.nthreads()
2

So first I was surprised by calling Lapack routing directly saved 30% in time. Secondly, even though I used two threads to solve the same problem, the computational time increased instead of decreasing, so probably I have misunderstood how threads work.

How would I do to speed up this type of calculations?

PetrKryslUCSD · October 3, 2020, 2:52pm

Set the number of BLAS threads to 1. BLAS.set_num_threads(1)

PetrKryslUCSD · October 3, 2020, 2:59pm

 ─────────────────────────────────────────────────────────────────────────────────────
                                              Time                   Allocations
                                      ──────────────────────   ───────────────────────
           Tot / % measured:               14.6s / 71.5%           3.49GiB / 47.9%

 Section                      ncalls     time   %tot     avg     alloc   %tot      avg
 ─────────────────────────────────────────────────────────────────────────────────────
 Inverse LAPACK lu inbounds        1    4.84s  46.3%   4.84s    427MiB  25.0%   427MiB
 Inverse lu julia                  1    2.42s  23.2%   2.42s    427MiB  25.0%   427MiB
 Inverse LAPACK lu                 1    1.80s  17.2%   1.80s    427MiB  25.0%   427MiB
 Inverse LAPACK lu threads         1    1.40s  13.4%   1.40s    429MiB  25.1%   429MiB
 ─────────────────────────────────────────────────────────────────────────────────────

Anemic Surface 4 with 2 cores. 2 threads.

PetrKryslUCSD · October 3, 2020, 4:40pm

BTW: if the matrices are so small, I’d consider StaticArrays.

stevengj · October 3, 2020, 4:41pm

Use StaticArrays.jl instead. For such small matrices, LAPACK has a huge overhead — you really just want to unroll the whole computation.

ChrisRackauckas · October 3, 2020, 4:55pm

Piggybacking off of @stevengj, we’ve found that StaticArrays is the fastest for < 16x16 or so, then RecursiveFactorization.jl (https://github.com/YingboMa/RecursiveFactorization.jl) is the fastest until about 200x200 (unless MKL), and then OpenBLAS is finally acceptable. So essentially, avoid OpenBLAS as much as you can here.

hcarlsso · October 4, 2020, 5:01pm

Yes indeed. I guess for 3x3 matrix inversion, it is just reasonable to just calculate the inverse explicitly.

using LinearAlgebra
using TimerOutputs
using Random
using StaticArrays

function f_1(N)
    Random.seed!(1)
    A = randn(3,3,N)
    b = randn(3,N)
    @timeit "LAPACK LU" begin
        @inbounds for n = 1:N
            @views LAPACK.gesv!(A[:,:,n], b[:,n])
        end
    end
    b
end
function f_2(N)
    Random.seed!(1)
    A = [@SMatrix randn(3,3) for n = 1:N]
    b = [@SVector randn(3) for n = 1:N]
    x = [@MVector zeros(3) for n = 1:N]
    @timeit "StaticArrays" begin
        @inbounds for n = 1:N
            x[n] .= A[n] \ b[n]
        end
    end
    xr = zeros(3,N)
    for n = 1:N
        xr[:,n]  = convert(Array, x[n])
    end
    xr
end


reset_timer!()
println("Compile")
fs = [f_1, f_2]
foreach(f -> f(1), fs)

reset_timer!()
println("Run")
N = 10^7
res = map(f -> f(N), fs)

for n = 2:length(fs)
    @assert res[n] ≈ res[1]
end

print_timer()

julia> include("test_inverting_3x3.jl")
Compile
Run
 ───────────────────────────────────────────────────────────────────────
                                Time                   Allocations      
                        ──────────────────────   ───────────────────────
    Tot / % measured:        7.76s / 44.1%           4.71GiB / 22.1%    

 Section        ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────
 LAPACK LU           1    3.29s  96.1%   3.29s   1.04GiB  100%   1.04GiB
 StaticArrays        1    132ms  3.86%   132ms     0.00B  0.00%    0.00B
 ───────────────────────────────────────────────────────────────────────
julia>

stevengj · October 4, 2020, 5:10pm

Can be SVector, not MVector. And you don’t need .=, just =. The reason is that you don’t need to mutate the entries of x[n], and can instead replace the whole vector.

RoyiAvital · October 4, 2020, 6:04pm

I’m curious how well will MKL be for repeating operations using its JIT interface with Direct Call.
I’d bet it can be very competitive with RecursiveFactorization.jl and maybe even StaticArrays.jl?

ChrisRackauckas · October 4, 2020, 6:10pm

If we did things right then it’ll match. If we didn’t, then we’ll benchmark it and find out how to make it the same. So yes, it’s worth benchmarking.

hcarlsso · October 5, 2020, 4:06pm

Many thanks for all the answers!

Topic		Replies	Views
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3722	February 11, 2020
Operations on small matrices and BLAS in 1.6.0 RC1 New to Julia	2	432	March 2, 2021
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13667	October 30, 2019
Alternate BLAS libraries? General Usage blas	22	3029	July 4, 2020
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3012	July 16, 2018

Independent LU factorization of small matrices not faster with threads

Related topics