# [ANN] RecursiveFactorization.jl

RecursiveFactorization allows you to perform any factorization you’d like so long as it is `lu`.

If you like LU, DifferentialEquations.jl, or recursion, you’ll love RecursiveFactorization.jl.

DifferentialEquations.jl uses it as the default method for solving matrices with dimension less than 500 x 500. This package will greatly boost the speed of implicit stiff differential equation solvers, as most of the runtime of implicit solvers is spent on the matrix factorization (the only O(n^3) operation in stiff solvers).

For the following plots, the reference is `LinearAlgebra.generic_lufact!` (a triple loop LU factorization with partial pivoting), `OpenBLAS` is invoked by `LinearAlgebra.lu!` , and `RF` denotes `RecursiveFactorization`. “RF fully iterative” means the base case of the recursion, not to be confused with an iterative solver. “RF fully recursive” refers to the algorithm that uses the recursive method for all sizes greater than 10, but the base case calls the optimized triple loop LU for blocks smaller than 16x16. The `threshold` is the cutoff point of using the recursive algorithm. (Sorry for the confusing labels!)

OpenBLAS 0.3.9
CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz

OpenBLAS 0.3.9
CPU: Intel(R) Core™ i9-10980XE CPU @ 3.00GHz

The first two examples have optimally tuned meta parameter.

OpenBLAS 0.3.9
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

This example has a badly tuned meta parameter. The default threshold value is 192 here, but one can see the optimal threshold value should be 80. You can override it via

``````RecursiveFactorization.RECURSION_THRESHOLD[] = 80
``````

These plots were generated via:

``````using Pkg
using RecursiveFactorization
include(joinpath(pkgdir(RecursiveFactorization), "perf", "lu.jl"))
``````

It will save the file in `joinpath(homedir(), "Pictures", "lu_float64.png")`. You want the threshold to be where the fully recursive version passes the fully iterative version.

If you run this, please post the results and your CPU info here, as this will help us develop a better default algorithm!

### Acknowledgment

This package uses @Elrod’s `LoopVectorization.jl`.

20 Likes

Very nice work Does the package also work for `Matrix{ComplexF64}`?

Yes, see https://github.com/YingboMa/RecursiveFactorization.jl/blob/a7901459875c5d1ad39e969ce9e16ad764de9557/test/runtests.jl#L19 . It is tested for `Float64, Float32, ComplexF64, ComplexF32, Real`. Though I haven’t benchmarked `Matrix{ComplexF64}`, I imagine it will be slower.

FWIW, I ran this script on my laptop:

``````:openblas64
Julia Version 1.5.0-DEV.854
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
``````

and desktop:

``````:mkl
CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
``````

Edit: same desktop on Julia 1.4.1, without MKL:

I have run into this before, but I’m surprised just how large the gap from openblas is here:

``````julia> rr = rand(50,50);

julia> @btime RecursiveFactorization.lu(\$rr); # on Julia 1.5
8.313 μs (4 allocations: 20.16 KiB)

julia> @btime lu(\$rr); # MKL, 1.5
12.471 μs (4 allocations: 20.16 KiB)

julia> @btime lu(\$rr); # openblas, 1.4.1
1.460 ms (4 allocations: 20.16 KiB)

julia> 1460/8.3
175.9036144578313
``````

On my laptop (with 1.4 or 1.5) it’s around `20 μs` which is more reasonable.

1 Like

On an old desktop

``````julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)
Environment:
JULIA_EDITOR = atom  -a
JULIA_CUDA_VERBOSE = true
``````

``````julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
``````
``````julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
``````

``````julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen Threadripper 2950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
``````

``````julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
``````

Thank you for running the benchmark!

Note that `RecursiveFactorization` calls `ldiv!` which in turn invokes `*trtrs` provided by your BLAS. Hence, RecursiveFactorization’s performance is all highly dependent on how well OpenBLAS performs. Unfortunately, from the benchmarks above, it seems that the `threshold` values is highly dependent on the performance of `*trtrs`, as only the recursive algorithm calls `ldiv!`. This is makes `threshold` almost impossible to tune for all computers. The next step might be to implement fast `*trtrs/*trsm` in Julia.

4 Likes

I think it’s also worth pointing out that the primary goal was to perform better under “default settings” for sizes < 100, which it looks like it is doing fairly consistently, often by large margins.

“Default settings” included not setting `BLAS.set_num_threads(1)`. OpenBLAS is making very poor use of threads here: making it use only a single thread substantially boosted its performance on the computers I tested!
This is also probably a factor in why OpenBLAS did worse on the higher core-count systems, but did fairly well on laptops and old desktops.

MKL, on the other hand, makes very good use of threads. My tests were consistent with what mcabbott found:

Also, because RecursiveFactorization is using `*trtrs`, you can see that it actually does better when paired with MKL itself (around 55 vs 45 GFLOPS in the two plots above).

Setting 1 thread will probably shrink the gap at the upper end of the size range in all cases: MKL will no longer be quite so far ahead, and (on systems where it was tripping over its own threads) OpenBLAS won’t be quite so far behind.

Aside from making performance more predictable, so that setting the recursion threshold will be easier, I do think Julia `*trtrs` will further help the package’s primary objective of increasing smaller size (<100) performance over OpenBLAS and even being competitive with MKL over that range regardless of threading and without the DiffEq ecosystem having to mess with the user’s settings.

4 Likes

Here are my results on a MacBook Pro Ice Lake.

``````Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)
``````

And I repeated it with 1.6:

``````Julia Version 1.6.0-DEV.306
Commit 59b8dde7c1 (2020-06-26 09:21 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)
``````

1 Like
``````Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = atom.cmd -a