Can somebody here help me understand why in this code solving with the L in A = LL^T is an order of magnitude slower than just solving using A? Here is a MWE:
using BenchmarkTools
if !(@isdefined x) # because I lazily re-run in a REPL.
const A = randn(5_000, 5_000)
const Mf = cholesky!(Symmetric(A'A))
end
# With:
# BLAS.get_num_threads() == 4
# BLAS.vendor() == :mkl
# versioninfo() == (v1.6.2, official binary, etc)
print("Full solve:") # 5.050 ms, 0 alloc.
@btime ldiv!($Mf, x) (setup=x=randn(5_000))
print("Cholesky factor solve:") # 59.8 ms, 3 alloc, 190 MiB.
@btime ldiv!($Mf.L, x) (setup=x=randn(5_000))
At larger matrix sizes, the difference is even more pronounced. For a 22k x 22k matrix, the differences was 0.1 seconds vs 1.5 seconds. Looking at the which(LinearAlgebra.ldiv!, ...)
outputs, it looks like with the triangular Mf.L
Julia is calling trtrs!
, whereas for Mf
it is calling potrs!
. I am sure that potrs!
is doing something smarter than just two sequential triangular solves, but it seems a little surprising that this ldiv!
makes big allocations and is that much slower. Intuitively, shouldn’t the solve with a triangular factor be faster, if anything?
If it matters, I have an intel CPU that does use AVX-512, and like the snippet above says, I use MKL with 4 BLAS threads. From looking around, I see other posts on the forum that say that trtrs!
is slower by a factor of two or three from other triangular solvers, but I haven’t seen this particular issue. Apologies if this is a rehash of a known issue or discussion.
EDIT:
Okay, there is definitely something confusing happening here. I deleted this topic because at first I think I had just made a mistake with my BenchmarkTools interpolation. When I did
@btime ldiv!($(Mf.L), x) (setup=x=randn(5000))
then the times looked reasonable. But the issue persists when I just do
const _x = randn(5000)
@time Mf.L\x # 0.05 seconds
@time Mf\x # 0.005 seconds
So it’s still an order of magnitude difference in this less careful benchmark. I honestly am not sure anymore if this is just a weird benchmarking issue or if trtrs!
vs potrs!
is actually causing a problem.
So many edits. Sorry everyone.