Can somebody here help me understand why in this code solving with the L in A = LL^T is an order of magnitude slower than just solving using A? Here is a MWE:
using BenchmarkTools
if !(@isdefined x) # because I lazily re-run in a REPL.
const A = randn(5_000, 5_000)
const Mf = cholesky!(Symmetric(A'A))
end
# With:
# BLAS.get_num_threads() == 4
# BLAS.vendor() == :mkl
# versioninfo() == (v1.6.2, official binary, etc)
print("Full solve:") # 5.050 ms, 0 alloc.
@btime ldiv!($Mf, x) (setup=x=randn(5_000))
print("Cholesky factor solve:") # 59.8 ms, 3 alloc, 190 MiB.
@btime ldiv!($Mf.L, x) (setup=x=randn(5_000))
At larger matrix sizes, the difference is even more pronounced. For a 22k x 22k matrix, the differences was 0.1 seconds vs 1.5 seconds. Looking at the which(LinearAlgebra.ldiv!, ...) outputs, it looks like with the triangular Mf.L Julia is calling trtrs!, whereas for Mf it is calling potrs!. I am sure that potrs! is doing something smarter than just two sequential triangular solves, but it seems a little surprising that this ldiv! makes big allocations and is that much slower. Intuitively, shouldn’t the solve with a triangular factor be faster, if anything?
If it matters, I have an intel CPU that does use AVX-512, and like the snippet above says, I use MKL with 4 BLAS threads. From looking around, I see other posts on the forum that say that trtrs! is slower by a factor of two or three from other triangular solvers, but I haven’t seen this particular issue. Apologies if this is a rehash of a known issue or discussion.
EDIT:
Okay, there is definitely something confusing happening here. I deleted this topic because at first I think I had just made a mistake with my BenchmarkTools interpolation. When I did
@btime ldiv!($(Mf.L), x) (setup=x=randn(5000))
then the times looked reasonable. But the issue persists when I just do
const _x = randn(5000)
@time Mf.L\x # 0.05 seconds
@time Mf\x # 0.005 seconds
So it’s still an order of magnitude difference in this less careful benchmark. I honestly am not sure anymore if this is just a weird benchmarking issue or if trtrs! vs potrs! is actually causing a problem.
So many edits. Sorry everyone.