Ldiv for Cholesky is slower than two substitutions

I’ve faced that ldiv for Cholesky is unexpectedly slower than two substitutions.

Here is a test problem with a random 1000-by-1000 Matrix{Float64}.

julia> using BenchmarkTools, LinearAlgebra

julia> n = 1000; A = rand(n, n); A = A' * A; b = rand(n);

Results for ldiv on factorization object.

julia> Achol = cholesky(A);

julia> @benchmark $Achol \ $b
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  315.875 μs … 531.959 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     317.833 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   319.034 μs ±   4.421 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄██▇▇▇
  ▃██████▇▃▁▂▄▆▅▇█▆▄▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  316 μs           Histogram: frequency by time          336 μs <

 Memory estimate: 8.00 KiB, allocs estimate: 1.

Results for two substitutions (factors of Achol are Upper/LowerTriangular).

julia> norm(Achol.U \ (Achol.L \ b) - Achol \ b, Inf)
1.7053025658242404e-12

julia> @benchmark U \ (L \ $b) setup=(L=$Achol.L; U=$Achol.U)
BenchmarkTools.Trial: 4178 samples with 1 evaluation.
 Range (min … max):  189.958 μs … 341.667 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     204.875 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   205.505 μs ±   7.942 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▅▇▆██▆▇▇▅▄▆▅▄▄▄▄▆▄▂▆▃▁▅▅▁▃▅▂▃▂▂▄▁▁
  ▂▁▂▃▂▃▃▄▆▆████████████████████████████████████▆▆▄▄▃▃▂▃▁▂▁▂▁▂▁ ▅
  190 μs           Histogram: frequency by time          224 μs <

 Memory estimate: 16.00 KiB, allocs estimate: 2.

Performance of the above example is close to ldiv for LU.

julia> Alu = lu(A);

julia> @benchmark $Alu \ $b
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  176.708 μs … 451.208 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     178.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   188.390 μs ±  32.983 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▂             ▃▃▃                                           ▁
  ████▇█▇▆▄▃▃▁▁▃▁▁███▇▃▃▅▁▄▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▆▃▅▆▇ █
  177 μs        Histogram: log(frequency) by time        400 μs <

 Memory estimate: 8.00 KiB, allocs estimate: 1.

The examples above I run on my MacBook. Here is corresponding versioninfo.

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M2 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 6 virtual cores)

But, I’ve faced similar situation with the following setups (can’t provide versioninfo for them)

  • Julia 1.10, Ubuntu 22.04, build for Generic Linux on x86 (so OpenBLAS should be under the hood)
  • Julia 1.11 on Windows 10, 64-bit installer

I expect that Achol \ b should be faster than Alu \ b. Do I measuring performance wrong or Achol \ b is buggy?

Using the code below, I get that the former is much faster than the second for n=1000. The difference is that you’re using the same matrix every time.

function cholera( n )
           A = rand( n, n )
           C = cholesky( A'A )
           b = rand( n )
           C, b
end

f1( Achol, b ) = Achol \ b

f2( Achol, b ) = Achol.U \ (Achol.L \ b )

@benchmark f1(C,b) setup=( (C,b) = cholera(n); )

@benchmark f2(C,b) setup=( (C,b) = cholera(n); )
BenchmarkTools.Trial: 293 samples with 1 evaluation per sample.
 Range (min … max):  574.176 μs … 751.006 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     601.859 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   606.899 μs ±  27.591 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▃  ▅▇█▃▃▂▂                                                
  ▃▁▃████▇███████▆▅▂▃▃▃▃▃▂▂▁▁▁▂▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▂▂▁▃ ▃
  574 μs           Histogram: frequency by time          739 μs <
BenchmarkTools.Trial: 246 samples with 1 evaluation per sample.
 Range (min … max):  2.490 ms … 8.196 ms  ┊ GC (min … max):  0.00% …  0.00%
 Time  (median):     3.565 ms             ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.447 ms ± 1.719 ms  ┊ GC (mean ± σ):  11.05% ± 16.52%

  ▅▂      ▂▆ █                   ▁▂                      ▂▄  
  ██▁▁▁▁▁▁██▆██▄▄▁▁▁▁▁▁▁▁▄▄▁▁▁▇▆███▄▆▇▆▄▁▁▁▁▇▇█▁▁▁▁▁▄▁▄▄▆██ ▆
  2.49 ms     Histogram: log(frequency) by time     7.79 ms <

Note that this allocates a matrix copy (in Julia 1.11) to obtain theAchol.L matrix each time, and it also allocates additional intermediate vectors, which will slow things down.

The Cholesky ldiv ! method callsl the LAPACK dpotrs, whereas the UpperTriangular and LowerTriangular methods for ldiv! call the LAPACK dtrtrs function.

(You might also want to turn off BLAS multi-threading, as that tends to make benchmark results more confusing.)

1 Like

Oh, it seems that LAPACK routines are optimized for matrix-matrix system A X = B, instead of matrix-vector A x = b. Performance for the matrix-matrix case looks correct.

julia> using BenchmarkTools, LinearAlgebra

julia> Threads.nthreads()
1

julia> n = 1000; A = rand(n, n); A = A' * A; B = rand(n, n);

julia> Achol = cholesky(A); Alu = lu(A);

julia> @btime $Achol \ $B;
  8.391 ms (2 allocations: 7.63 MiB)

julia> @btime $Alu \ $B;
  8.499 ms (2 allocations: 7.63 MiB)

julia> @btime U \ (L \ $B) setup=(L=$Achol.L; U=$Achol.U);
  8.774 ms (4 allocations: 15.26 MiB)
1 Like

To elaborate on what @stevengj said: You’re pulling the most costly part of the computation out into the setup here, so you’re only measuring a small part of the actual cost of the operation. In a fair benchmark, Achol \ b wins handily. Here’s what I find:

julia> @benchmark $Achol \ $b
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  248.917 μs … 388.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     261.416 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   263.493 μs ±   9.988 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▂   ▃██▄▂▃▃▃▂▁
  ▃██▇▆▆████████████▆▇▇▇▇▆▆▆▇▆▇▇▇▇▇▇▇▇▇▆▅▅▅▆▅▅▄▄▄▄▃▃▂▂▂▃▂▂▂▂▂▂▂ ▄
  249 μs           Histogram: frequency by time          289 μs <

 Memory estimate: 8.00 KiB, allocs estimate: 1.

julia> @benchmark U \ (L \ $b) setup=(U = $Achol.U; L = $Achol.L)  # Unfair
BenchmarkTools.Trial: 5858 samples with 1 evaluation per sample.
 Range (min … max):  132.750 μs … 227.459 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     145.292 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   145.427 μs ±   5.723 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▁▄▄▄▄▄▄▃▂▄▄▄▄▆▅█▆▅▄▆▄▄▅▅▂▃
  ▁▁▁▁▂▃▄▅▄▇▇███████████████████████████▇▇▅▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁ ▅
  133 μs           Histogram: frequency by time          162 μs <

 Memory estimate: 16.00 KiB, allocs estimate: 2.

julia> myldiv(chol, x) = chol.U \ (chol.L \ x)
myldiv (generic function with 1 method)

julia> @benchmark myldiv($Achol, $b)  # Fair, but slower
BenchmarkTools.Trial: 5796 samples with 1 evaluation per sample.
 Range (min … max):  659.125 μs …   2.103 ms  ┊ GC (min … max): 0.00% … 57.96%
 Time  (median):     765.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   862.006 μs ± 209.153 μs  ┊ GC (mean ± σ):  9.99% ± 14.68%

    ▁▂▂▆██▆▄▄▄▄▃▂▂▂▁▁                   ▁▁▂▂▃▃▃▂▂▂▁▁▁           ▂
  ▆▆█████████████████▇▆▇▅▁▄▄▃▃▁▁▁▁▁▁▁▁▄▆████████████████▇█▆▆▆▅▄ █
  659 μs        Histogram: log(frequency) by time        1.5 ms <

 Memory estimate: 7.65 MiB, allocs estimate: 4.

Let me clear the context. For actual solver I have “infinite” time budget to prepare, but limited time budget for real-time computations (offline/online computations). E.g. for me it’s fair to compute cholesky(A) or Achol.L/Achol.U and store it on a hard drive. I’m interested only in performance of real-time computations of A \ b. That’s why putting Achol.L and Achol.U into setup is valid and no-cost for me.

To be clear, the source code of the reference dpotrs routine simply calls dtrsm twice, whereas dtrtrs also calls dtrsm. So, in principle, there should be no advantage to doing the triangular solves yourself.

Turns out it’s faster to avoid the extra preparation anyway. You can use an adjoint wrapper instead of materializing L, and the resulting computations are even faster:

julia> function myldiv(chol, x)
           if chol.uplo == 'U'
               return ldiv!(chol.U, chol.U' \ x)
           elseif chol.uplo == 'L'
               return ldiv!(chol.L', chol.L \ x)
           else
               error("unreachable")
           end
       end
myldiv (generic function with 1 method)

julia> @benchmark myldiv($Achol, $b)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  120.500 μs … 276.959 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     129.959 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   130.370 μs ±   8.761 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▁                 █
  ██▆▄▅▅▅▄▄▄▃▃▃▃▃▃▃▄▄██▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  120 μs           Histogram: frequency by time          166 μs <

 Memory estimate: 8.00 KiB, allocs estimate: 1.

Compare this to my timings above for the U \ (L \ $b) version on the same computer:

 Range (min … max):  132.750 μs … 227.459 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     145.292 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   145.427 μs ±   5.723 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▁▄▄▄▄▄▄▃▂▄▄▄▄▆▅█▆▅▄▆▄▄▅▅▂▃
  ▁▁▁▁▂▃▄▅▄▇▇███████████████████████████▇▇▅▆▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁ ▅
  133 μs           Histogram: frequency by time          162 μs <

 Memory estimate: 16.00 KiB, allocs estimate: 2.

(Note that I used ldiv! to avoid an extra allocation in the new version, but that doesn’t significantly affect the timings, so the comparison is still valid. The O(N) allocation cost is dwarfed by the O(N^2) computation cost at this size.)

1 Like

Thanks for this snippet! It works for me. In your environment, Achol \ b is slower than myldiv(Achol, b)?

Considering answer of @danielwe

Should LinearAlgebra have a method ldiv!(::Cholesky, ::StridedVector)?

Yes, on mine (for the 1000x1000 test case), benchmarking with 1 thread:

julia> BLAS.set_num_threads(1)

julia> @btime $Achol \ $b;
  314.125 μs (3 allocations: 8.06 KiB)

julia> @btime myldiv($Achol, $b);
  147.167 μs (3 allocations: 8.06 KiB)

Seems like an issue in OpenBLAS.

1 Like

I think the difference here is between OpenBLAS’ trsv and trsm. As @stevengj pointed out, Achol \ b will dispatch to LAPACK.potrs! which calls trsm. While reference LAPACK’ trtrs simply calls trsm, this is not the case for OpenBLAS’ trtrs which branches on n == 1. So the main difference in performance for the two benchmarked solutions seems to be due to differences in OpenBLAS’ trsv and trsm which is confirmed below. I ran this on an Intel Mac. It would be interesting to see these numbers for MKL.

julia> @benchmark BLAS.trsv!('U', 'N', 'N', $(Achol).factors, copy($b))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  77.639 μs … 311.568 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     83.043 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.832 μs ±  14.097 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▃▆▇█▇▅▃▂▄▄▄▂▁▃▃▂▁         ▁▁  ▁            ▁                ▂
  ███████████████████▇▆▇▅▆▇▇▇██▇██▇▇▅▅█▆▅▆▇▄▃▆██▅▆██▆▅▅▄▃▄▃▄▄▄ █
  77.6 μs       Histogram: log(frequency) by time       144 μs <

 Memory estimate: 8.06 KiB, allocs estimate: 3.

julia> @benchmark BLAS.trsm!('L', 'U', 'N', 'N', 1.0, $(Achol).factors, copy($B))
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  245.243 μs … 855.663 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     267.437 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   276.065 μs ±  40.048 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▃█▆▅▃▇▆▆▄▁▁▃▃▂▁▁▂▃▂▂▃▂▁▂  ▁  ▂                               ▂
  ████████████████████████████████▆▇▇▇▇█▆▇▆▆▇▄▅▆▅▅▇▅▅▆▅▄▆▃▃▄▃▅▄ █
  245 μs        Histogram: log(frequency) by time        445 μs <

 Memory estimate: 8.08 KiB, allocs estimate: 3.

(I should mentioned that B = reshape(b, length(b), 1)).

3 Likes

Yes. Here are my timings for Achol \ b (copied from my original post above). myldiv is about twice as fast.

 Range (min … max):  248.917 μs … 388.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     261.416 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   263.493 μs ±   9.988 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%