[ANN] RecursiveFactorization.jl

RecursiveFactorization allows you to perform any factorization you’d like so long as it is lu.

If you like LU, DifferentialEquations.jl, or recursion, you’ll love RecursiveFactorization.jl.

DifferentialEquations.jl uses it as the default method for solving matrices with dimension less than 500 x 500. This package will greatly boost the speed of implicit stiff differential equation solvers, as most of the runtime of implicit solvers is spent on the matrix factorization (the only O(n^3) operation in stiff solvers).

For the following plots, the reference is LinearAlgebra.generic_lufact! (a triple loop LU factorization with partial pivoting), OpenBLAS is invoked by LinearAlgebra.lu! , and RF denotes RecursiveFactorization. “RF fully iterative” means the base case of the recursion, not to be confused with an iterative solver. “RF fully recursive” refers to the algorithm that uses the recursive method for all sizes greater than 10, but the base case calls the optimized triple loop LU for blocks smaller than 16x16. The threshold is the cutoff point of using the recursive algorithm. (Sorry for the confusing labels!)

OpenBLAS 0.3.9
CPU: Intel® Xeon® Silver 4114 CPU @ 2.20GHz

OpenBLAS 0.3.9
CPU: Intel® Core™ i9-10980XE CPU @ 3.00GHz

The first two examples have optimally tuned meta parameter.

OpenBLAS 0.3.9
CPU: Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz

This example has a badly tuned meta parameter. The default threshold value is 192 here, but one can see the optimal threshold value should be 80. You can override it via

RecursiveFactorization.RECURSION_THRESHOLD[] = 80

These plots were generated via:

using Pkg
Pkg.add(["RecursiveFactorization", "VegaLite", "DataFrames", "BenchmarkTools"])
using RecursiveFactorization
include(joinpath(pkgdir(RecursiveFactorization), "perf", "lu.jl"))

It will save the file in joinpath(homedir(), "Pictures", "lu_float64.png"). You want the threshold to be where the fully recursive version passes the fully iterative version.

If you run this, please post the results and your CPU info here, as this will help us develop a better default algorithm!

Acknowledgment

This package uses @Elrod’s LoopVectorization.jl.

20 Likes

Very nice work :slight_smile: Does the package also work for Matrix{ComplexF64}?

Yes, see https://github.com/YingboMa/RecursiveFactorization.jl/blob/a7901459875c5d1ad39e969ce9e16ad764de9557/test/runtests.jl#L19 . It is tested for Float64, Float32, ComplexF64, ComplexF32, Real. Though I haven’t benchmarked Matrix{ComplexF64}, I imagine it will be slower.

FWIW, I ran this script on my laptop:

:openblas64
Julia Version 1.5.0-DEV.854
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  JULIA_NUM_THREADS = 4

and desktop:

:mkl
  CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
  JULIA_NUM_THREADS = 12

Edit: same desktop on Julia 1.4.1, without MKL:

I have run into this before, but I’m surprised just how large the gap from openblas is here:

julia> rr = rand(50,50);

julia> @btime RecursiveFactorization.lu($rr); # on Julia 1.5
  8.313 μs (4 allocations: 20.16 KiB)

julia> @btime lu($rr); # MKL, 1.5
  12.471 μs (4 allocations: 20.16 KiB)

julia> @btime lu($rr); # openblas, 1.4.1
  1.460 ms (4 allocations: 20.16 KiB)

julia> 1460/8.3
175.9036144578313

On my laptop (with 1.4 or 1.5) it’s around 20 μs which is more reasonable.

1 Like

On an old desktop

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4
  JULIA_CUDA_VERBOSE = true

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 2
julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 2950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Thank you for running the benchmark!

Note that RecursiveFactorization calls ldiv! which in turn invokes *trtrs provided by your BLAS. Hence, RecursiveFactorization’s performance is all highly dependent on how well OpenBLAS performs. Unfortunately, from the benchmarks above, it seems that the threshold values is highly dependent on the performance of *trtrs, as only the recursive algorithm calls ldiv!. This is makes threshold almost impossible to tune for all computers. The next step might be to implement fast *trtrs/*trsm in Julia.

4 Likes

I think it’s also worth pointing out that the primary goal was to perform better under “default settings” for sizes < 100, which it looks like it is doing fairly consistently, often by large margins.

“Default settings” included not setting BLAS.set_num_threads(1). OpenBLAS is making very poor use of threads here: making it use only a single thread substantially boosted its performance on the computers I tested!
This is also probably a factor in why OpenBLAS did worse on the higher core-count systems, but did fairly well on laptops and old desktops.

MKL, on the other hand, makes very good use of threads. My tests were consistent with what mcabbott found:

Also, because RecursiveFactorization is using *trtrs, you can see that it actually does better when paired with MKL itself (around 55 vs 45 GFLOPS in the two plots above).

Setting 1 thread will probably shrink the gap at the upper end of the size range in all cases: MKL will no longer be quite so far ahead, and (on systems where it was tripping over its own threads) OpenBLAS won’t be quite so far behind.

Aside from making performance more predictable, so that setting the recursion threshold will be easier, I do think Julia *trtrs will further help the package’s primary objective of increasing smaller size (<100) performance over OpenBLAS and even being competitive with MKL over that range regardless of threading and without the DiffEq ecosystem having to mess with the user’s settings.

4 Likes

Here are my results on a MacBook Pro Ice Lake.

Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, goldmont)

And I repeated it with 1.6:

Julia Version 1.6.0-DEV.306
Commit 59b8dde7c1 (2020-06-26 09:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, icelake-client)

1 Like
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = atom.cmd -a
  JULIA_NUM_THREADS = 6
  JULIA_PKG_SERVER = pkg.julialang.org