How to time julia's choleskey factorization?

Hi,

I’m trying to compare timings of julia’s cholesky factorization and LAPACK.potrf! function.

The timings of julia’s cholesky function is varying from 50 \mu s to 100 \mu s and LAPACK.potrf! is blazingly fast varying from 110 ns to 113 ns.

Why are cholesky function timings varying widely? Am I timing it correctly?

Here is the code I’ve used:

using LinearAlgebra
using BenchmarkTools

x = rand(1000, 1000);
pd = x * x'

@btime cholesky($pd);
@btime LAPACK.potrf!('L', $pd);

potrf! is probably exiting early due to the mutated matrix no longer being pd and is returning an error code.
Meanwhile, cholesky is not in place. It performs a symmetry check and then successfully factorizes.

2 Likes

You are correct. potrf! is exiting with info = 2.

For what it’s worth, if I change the benchmark to use setup, I get these timings:

julia> @btime LAPACK.potrf!('L', y) setup=(y = copy(pd));
  9.175 ms (1 allocation: 32 bytes)

julia> @btime cholesky(y) setup=(y = copy(pd));
  12.415 ms (5 allocations: 7.63 MiB)
1 Like

I’ve used the @time macro. potrf! is as fast as cholesky, I’ve expected potrf! to be faster. In your opinion is cholesky function as fast as possible or are there any opportunities to speed it up?

Maybe - since you’re using the non-mutating version, there’s some overhead for creating the new array. See this example with larger sizes, to see how it scales:

julia> size(pd)
(10000, 10000)

julia> @btime cholesky(y) setup=(y = copy(pd));
  6.689 s (5 allocations: 762.94 MiB)

julia> @btime LAPACK.potrf!('U', y) setup=(y = copy(pd));
  5.210 s (1 allocation: 32 bytes)

Whereas the mutating version has this for the same size:

julia> @btime cholesky!(y) setup=(y = copy(pd));
  5.701 s (3 allocations: 80 bytes)

And for the smaller size:

julia> @btime cholesky!(y) setup=(y = copy(pd)) evals=1;
  11.267 ms (3 allocations: 80 bytes)

I’ve used evals=1 here to prevent the same array being used twice.

There also seems to be some redundant checks in that specific code path, so there might be some gains there as well (checking for squareness twice, for example…).

I would suggest using Symmetric or Hermitian to skip the symmetry check.

julia> using LinearAlgebra

julia> BLAS.vendor()
:openblas64

julia> pd = randn(10_000, 12_000) |> x -> x * x';

julia> BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2) # 18

julia> @time cholesky(pd);
  0.696583 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.499170 seconds (6 allocations: 762.940 MiB, 1.21% gc time)

julia> @time cholesky(pd);
  0.678071 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.497463 seconds (6 allocations: 762.940 MiB, 1.13% gc time)

julia> @time cholesky(pd);
  0.679534 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.500402 seconds (6 allocations: 762.940 MiB, 1.06% gc time)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.495150 seconds (3 allocations: 762.940 MiB)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.496955 seconds (3 allocations: 762.940 MiB, 0.83% gc time)

Another option to improve performance is to use MKL.jl:

julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> pd = randn(10_000, 12_000) |> x -> x * x';

julia> BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2) # 18

julia> @time cholesky(pd);
  0.812514 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.467005 seconds (6 allocations: 762.940 MiB, 1.92% gc time)

julia> @time cholesky(pd);
  0.640539 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.465069 seconds (6 allocations: 762.940 MiB, 1.39% gc time)

julia> @time cholesky(pd);
  0.640831 seconds (5 allocations: 762.940 MiB)

julia> @time cholesky(Symmetric(pd));
  0.465175 seconds (6 allocations: 762.940 MiB, 1.28% gc time)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.458559 seconds (3 allocations: 762.940 MiB)

julia> @time LAPACK.potrf!('U', copy(pd));
  0.462918 seconds (3 allocations: 762.940 MiB, 0.87% gc time)

Although the performance is close after setting them to both use 1 thread per physical core. OpenBLAS defaulted to only using 8 on my 18 core computer. Most of MKL’s advantage seemed to be in using more cores by default.

2 Likes

BTW, note that you need to set evals=1 otherwise your mutating potrf! will be called multiple times per setup, which could skew your results. See How to benchmark append!? and the other conversations linked from there.

You do realize that Julia’s cholesky function calls LAPACK potrf! (for Matrix{Float64} types), right? They are not separate implementations.

Once you fix the bugs your benchmark code, you’re mainly just measuring the overhead of allocating a new array versus working in-place (same as cholesky vs. cholesky!).

5 Likes