How to time julia's choleskey factorization?

vkv · October 2, 2020, 5:06am

Hi,

I’m trying to compare timings of julia’s cholesky factorization and LAPACK.potrf! function.

The timings of julia’s cholesky function is varying from 50 \mu s to 100 \mu s and LAPACK.potrf! is blazingly fast varying from 110 ns to 113 ns.

Why are cholesky function timings varying widely? Am I timing it correctly?

Here is the code I’ve used:

using LinearAlgebra
using BenchmarkTools

x = rand(1000, 1000);
pd = x * x'

@btime cholesky($pd);
@btime LAPACK.potrf!('L', $pd);

Elrod · October 2, 2020, 5:50am

potrf! is probably exiting early due to the mutated matrix no longer being pd and is returning an error code.
Meanwhile, cholesky is not in place. It performs a symmetry check and then successfully factorizes.

vkv · October 2, 2020, 6:04am

You are correct. potrf! is exiting with info = 2.

Sukera · October 2, 2020, 7:14am

For what it’s worth, if I change the benchmark to use setup, I get these timings:

julia> @btime LAPACK.potrf!('L', y) setup=(y = copy(pd));
  9.175 ms (1 allocation: 32 bytes)

julia> @btime cholesky(y) setup=(y = copy(pd));
  12.415 ms (5 allocations: 7.63 MiB)

vkv · October 2, 2020, 7:39am

I’ve used the @time macro. potrf! is as fast as cholesky, I’ve expected potrf! to be faster. In your opinion is cholesky function as fast as possible or are there any opportunities to speed it up?

Sukera · October 2, 2020, 7:59am

Maybe - since you’re using the non-mutating version, there’s some overhead for creating the new array. See this example with larger sizes, to see how it scales:

julia> size(pd)
(10000, 10000)

julia> @btime cholesky(y) setup=(y = copy(pd));
  6.689 s (5 allocations: 762.94 MiB)

julia> @btime LAPACK.potrf!('U', y) setup=(y = copy(pd));
  5.210 s (1 allocation: 32 bytes)

Whereas the mutating version has this for the same size:

julia> @btime cholesky!(y) setup=(y = copy(pd));
  5.701 s (3 allocations: 80 bytes)

And for the smaller size:

julia> @btime cholesky!(y) setup=(y = copy(pd)) evals=1;
  11.267 ms (3 allocations: 80 bytes)

I’ve used evals=1 here to prevent the same array being used twice.

There also seems to be some redundant checks in that specific code path, so there might be some gains there as well (checking for squareness twice, for example…).

Elrod · October 2, 2020, 2:22pm

I would suggest using Symmetric or Hermitian to skip the symmetry check.

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas64

julia> pd = randn(10_000, 12_000) |> x -> x * x';

julia> BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2) # 18

julia> @time cholesky(pd);
  0.696583 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.499170 seconds (6 allocations: 762.940 MiB, 1.21% gc time)

julia> @time cholesky(pd);
  0.678071 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.497463 seconds (6 allocations: 762.940 MiB, 1.13% gc time)

julia> @time cholesky(pd);
  0.679534 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.500402 seconds (6 allocations: 762.940 MiB, 1.06% gc time)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.495150 seconds (3 allocations: 762.940 MiB)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.496955 seconds (3 allocations: 762.940 MiB, 0.83% gc time)

Another option to improve performance is to use MKL.jl:

julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> pd = randn(10_000, 12_000) |> x -> x * x';

julia> BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2) # 18

julia> @time cholesky(pd);
  0.812514 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.467005 seconds (6 allocations: 762.940 MiB, 1.92% gc time)

julia> @time cholesky(pd);
  0.640539 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.465069 seconds (6 allocations: 762.940 MiB, 1.39% gc time)

julia> @time cholesky(pd);
  0.640831 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.465175 seconds (6 allocations: 762.940 MiB, 1.28% gc time)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.458559 seconds (3 allocations: 762.940 MiB)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.462918 seconds (3 allocations: 762.940 MiB, 0.87% gc time)

Although the performance is close after setting them to both use 1 thread per physical core. OpenBLAS defaulted to only using 8 on my 18 core computer. Most of MKL’s advantage seemed to be in using more cores by default.

rdeits · October 2, 2020, 2:37pm

BTW, note that you need to set evals=1 otherwise your mutating potrf! will be called multiple times per setup, which could skew your results. See How to benchmark append!? - #7 by rdeits and the other conversations linked from there.

stevengj · October 2, 2020, 2:44pm

You do realize that Julia’s cholesky function calls LAPACK potrf! (for Matrix{Float64} types), right? They are not separate implementations.

Once you fix the bugs your benchmark code, you’re mainly just measuring the overhead of allocating a new array versus working in-place (same as cholesky vs. cholesky!).

Topic		Replies	Views
Why the time consumed of LAPACK.potrf! increase hugely when the dimension is bigger than a special number Numerics	4	612	June 7, 2022
Slow Cholesky factorization for sparse matrices? Performance linearalgebra , sparse	6	925	September 19, 2020
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7822	May 21, 2017
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3722	February 11, 2020
Slow lower triangular solves when compared to full Cholesky Numerics numerics	6	740	September 14, 2021

How to time julia's choleskey factorization?

Related topics