Slow lower triangular solves when compared to full Cholesky

Can somebody here help me understand why in this code solving with the L in A = LL^T is an order of magnitude slower than just solving using A? Here is a MWE:

using BenchmarkTools

if !(@isdefined x) # because I lazily re-run in a REPL.
  const A  = randn(5_000, 5_000)
  const Mf = cholesky!(Symmetric(A'A))
end

# With:
# BLAS.get_num_threads() == 4
# BLAS.vendor() == :mkl
# versioninfo() == (v1.6.2, official binary, etc)

print("Full solve:") # 5.050 ms, 0 alloc.
@btime ldiv!($Mf, x) (setup=x=randn(5_000))
print("Cholesky factor solve:") # 59.8 ms, 3 alloc, 190 MiB.
@btime ldiv!($Mf.L, x) (setup=x=randn(5_000))

At larger matrix sizes, the difference is even more pronounced. For a 22k x 22k matrix, the differences was 0.1 seconds vs 1.5 seconds. Looking at the which(LinearAlgebra.ldiv!, ...) outputs, it looks like with the triangular Mf.L Julia is calling trtrs!, whereas for Mf it is calling potrs!. I am sure that potrs! is doing something smarter than just two sequential triangular solves, but it seems a little surprising that this ldiv! makes big allocations and is that much slower. Intuitively, shouldn’t the solve with a triangular factor be faster, if anything?

If it matters, I have an intel CPU that does use AVX-512, and like the snippet above says, I use MKL with 4 BLAS threads. From looking around, I see other posts on the forum that say that trtrs! is slower by a factor of two or three from other triangular solvers, but I haven’t seen this particular issue. Apologies if this is a rehash of a known issue or discussion.

EDIT:
Okay, there is definitely something confusing happening here. I deleted this topic because at first I think I had just made a mistake with my BenchmarkTools interpolation. When I did

@btime ldiv!($(Mf.L), x) (setup=x=randn(5000))

then the times looked reasonable. But the issue persists when I just do

const _x = randn(5000)
@time Mf.L\x # 0.05 seconds
@time Mf\x # 0.005 seconds

So it’s still an order of magnitude difference in this less careful benchmark. I honestly am not sure anymore if this is just a weird benchmarking issue or if trtrs! vs potrs! is actually causing a problem.

So many edits. Sorry everyone.

1 Like

Doesn’t Mf.L first have to copy the triangular factor into a newly allocated matrix? This has O(n^2) complexity similar to the solve itself, so it is quite a significant cost.

When you interpolate $(Mf.L) into @btime, you are precomputing this copy operation, removing it from the benchmark time.

1 Like

Indeed I get big differences as suggested by @stevengj:

julia> @btime ldiv!($Mf.L, x) (setup=x=randn(5000));
  169.805 ms (3 allocations: 190.73 MiB)

julia> @btime Mf.L;
  167.914 ms (3 allocations: 190.73 MiB)

Inspection reveals that Mf.L calls getproperty, which does a copy and explains almost all the time and allocations. So let’s do the copy in setup:

julia> @btime ldiv!(MfL,x) setup=(MfL=Mf.L;x=randn(5000));
  5.993 ms (0 allocations: 0 bytes)

I believe you’re not intended do use .L performantly, and it’s intended for most people to just do a solve.

1 Like

In some applications you need it. But in that case you can just retrieve the L factor once, when the factorization is constructed — it’s much cheaper than computing the Cholesky factorization in the first place.

2 Likes

Thank you both! I should have thought to look in the source about the difference in getfield for :L and :U. I was under the impression that getfield with :L was giving me a slightly fancier or wrapped version of getfield(Mf, :U)', but it actually does make the eager copy.

It’s sort of an interesting performance “gotcha”, because spamming @time Mf.U'\x in the REPL in the same setting shows that even that is faster than the @time Mf\x. It makes sense, because again looking at the source and examining Mf.factors, it’s clear that LAPACK stores the actual output in the U matrix, and there is no uplo option for cholesky!(M::Matrix, [...]). I’m a bit curious about the eager copy. I would have expected a function like lowerfactor(...) or whatever to be eager and make a copy and a getproperty to return the lazy getproperty(..., :U)'. Is there a simple explanation for that choice?

In any case, I appreciate the sanity check about this (and the explanation about why the @btime interpolation was changing things). Thanks again!

Actually I think you are right, there may be some performance left on the table. There is no field U or L but rather factors which can store U and/or L depending on field uplo. In this case, it’s set to U so you are right, Mf.U is blazingly fast. But if you ask for the wrong triangular, it makes a curious choice:

    if d === :U
        return UpperTriangular(sym_uplo(Cuplo) == d ? Cfactors : copy(Cfactors'))

This says that asking for L (when U is stored) copies the adjoint and makes a view of it, whereas it seems like it could have just made a view and taken the adjoint, which should be lazy and not need a copy. (I’m not an expert, so please correct me if I’m way off base.)

Though in my mind, the problem isn’t just the eagerness of the copy, but also the fact that you wanted to look at it. It’s like Schroedinger’s cat, which we all know allocates memory once you try to observe its state :).

I wonder whether you really need L and what you plan to do with it. When you do a solve, it’s not like there’s any need to explicitly make L or U. It’s known that there’s a square factors of which only a triangular part is needed. So if getproperty can’t be improved, maybe your code can access factors directly and do the right thing with it. (EDIT: missing apostrophe)

1 Like

I hear you. In general I don’t need a specific factor. But for some applications of Hutchinson-type stochastic trace estimation, having a symmetric factor can be enormously beneficial. I shouldn’t let the discussion in this (solved) thread drift too much, but I often need to estimate traces like \text{tr}(A^{-1} B), where A = L L^T. And as it turns out, estimating the trace with

\sum_{j=1}^M \textbf{u}_j L^{-1} B L^{-T} \textbf{u}_j

instead of

\sum_{j=1}^M \textbf{u}_j A^{-1} B \textbf{u}_j

(where the \textbf{u}_j vectors have iid Rademacher components) has substantial benefits in variance reduction. For the specific application in Gaussian processes that I work on, there is a brief discussion that the symmetrized version is at least as good here, although the theorem itself pertains to the non-symmetrized estimator.

In any case, I’ll just work with U instead and everything is fine. At this point I’m just complaining because I personally would prefer the behavior of getfield(M, :L) to give me the lazy getfield(M, :U)'. But life goes on.

1 Like