There are a lot of constraints on order, making it much harder to parallelize than gemm.
To elaborate, we can imagine solving for an upper triangular matrix blockwise, by dividing it into M
blocks.
Before solving for anything else, we need to solve for the first diagonal block U_{1,1} = cholesky(S_{1,1})
– serially.
Now we can solve U_{1,m} = U_{1,1} \ S_{1,m}
in parallel for m = 2,...,M
.
Then we have to calculate U_{2,2} = cholesky(S_{2,2} - U_{1,2}'*U_{1,2})
.*
Which you can then use for solving U_{2,m} = U_{2,2} \ S_{2,m}
in parallel for m = 3,...,M
.
Etc. As we finish more and more diagonals, the m = i,...,M
gets short, and there aren’t many blocks we can calculate in parallel.
In short, the problem is that if the blocks are too small, we have to spend all our time on communication between cores, because for every one of these diagonal blocks we need communication between cores.
If the blocks are too big, then we wont’ be able to utilize all our cores on those triangular solves for a substantial chunk or even all of the factorization.
- Actually, I’d have each thread calculate
S_{m,m} - U_{1,m}'*U_{1,m}
in parallel, but these individualcholesky
factorizations on the blocks still have to happen serially.
FWIW, I do see a fairly good performance improvement:
julia> using LinearAlgebra, BenchmarkTools
julia> S = randn(10_000, 12_000) |> x -> x * x'; # Wishart;
julia> S2 = similar(S);
julia> BLAS.set_num_threads(1);
julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 1 thread
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 2
--------------
minimum time: 3.145 s (0.00% GC)
median time: 3.148 s (0.00% GC)
mean time: 3.148 s (0.00% GC)
maximum time: 3.151 s (0.00% GC)
--------------
samples: 2
evals/sample: 1
julia> BLAS.set_num_threads(18);
julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 18 threads
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 2
--------------
minimum time: 377.574 ms (0.00% GC)
median time: 379.544 ms (0.00% GC)
mean time: 380.256 ms (0.00% GC)
maximum time: 385.087 ms (0.00% GC)
--------------
samples: 14
evals/sample: 1
julia> @benchmark copyto!($S2, $S) # copyto! is slow
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 79.912 ms (0.00% GC)
median time: 80.027 ms (0.00% GC)
mean time: 80.050 ms (0.00% GC)
maximum time: 80.408 ms (0.00% GC)
--------------
samples: 63
evals/sample: 1
However, these matrices are much larger than yours. If they’re 200x200:
julia> S = randn(200, 300) |> x -> x * x'; # Wishart;
julia> S2 = similar(S);
julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 18 threads
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 2
--------------
minimum time: 92.795 μs (0.00% GC)
median time: 142.325 μs (0.00% GC)
mean time: 142.279 μs (0.00% GC)
maximum time: 223.126 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> BLAS.set_num_threads(1);
julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 1 thread
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 2
--------------
minimum time: 76.469 μs (0.00% GC)
median time: 77.018 μs (0.00% GC)
mean time: 77.222 μs (0.00% GC)
maximum time: 170.254 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
I also see very little improvement.
You could call Symmetric
or Hermitian
on the matrices before calling cholesky
to skip the symmetry check.