OpenBLAS + Flux bottleknecks?

There are a lot of constraints on order, making it much harder to parallelize than gemm.
To elaborate, we can imagine solving for an upper triangular matrix blockwise, by dividing it into M blocks.
Before solving for anything else, we need to solve for the first diagonal block U_{1,1} = cholesky(S_{1,1}) – serially.
Now we can solve U_{1,m} = U_{1,1} \ S_{1,m} in parallel for m = 2,...,M.
Then we have to calculate U_{2,2} = cholesky(S_{2,2} - U_{1,2}'*U_{1,2}).*
Which you can then use for solving U_{2,m} = U_{2,2} \ S_{2,m} in parallel for m = 3,...,M.
Etc. As we finish more and more diagonals, the m = i,...,M gets short, and there aren’t many blocks we can calculate in parallel.

In short, the problem is that if the blocks are too small, we have to spend all our time on communication between cores, because for every one of these diagonal blocks we need communication between cores.
If the blocks are too big, then we wont’ be able to utilize all our cores on those triangular solves for a substantial chunk or even all of the factorization.

  • Actually, I’d have each thread calculate S_{m,m} - U_{1,m}'*U_{1,m} in parallel, but these individual cholesky factorizations on the blocks still have to happen serially.

FWIW, I do see a fairly good performance improvement:

julia> using LinearAlgebra, BenchmarkTools

julia> S = randn(10_000, 12_000) |> x -> x * x'; # Wishart;

julia> S2 = similar(S);

julia> BLAS.set_num_threads(1);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 1 thread
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     3.145 s (0.00% GC)
  median time:      3.148 s (0.00% GC)
  mean time:        3.148 s (0.00% GC)
  maximum time:     3.151 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> BLAS.set_num_threads(18);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 18 threads
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     377.574 ms (0.00% GC)
  median time:      379.544 ms (0.00% GC)
  mean time:        380.256 ms (0.00% GC)
  maximum time:     385.087 ms (0.00% GC)
  --------------
  samples:          14
  evals/sample:     1

julia> @benchmark copyto!($S2, $S) # copyto! is slow
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     79.912 ms (0.00% GC)
  median time:      80.027 ms (0.00% GC)
  mean time:        80.050 ms (0.00% GC)
  maximum time:     80.408 ms (0.00% GC)
  --------------
  samples:          63
  evals/sample:     1

However, these matrices are much larger than yours. If they’re 200x200:

julia> S = randn(200, 300) |> x -> x * x'; # Wishart;

julia> S2 = similar(S);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 18 threads
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     92.795 μs (0.00% GC)
  median time:      142.325 μs (0.00% GC)
  mean time:        142.279 μs (0.00% GC)
  maximum time:     223.126 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> BLAS.set_num_threads(1);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 1 thread
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     76.469 μs (0.00% GC)
  median time:      77.018 μs (0.00% GC)
  mean time:        77.222 μs (0.00% GC)
  maximum time:     170.254 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I also see very little improvement.

You could call Symmetric or Hermitian on the matrices before calling cholesky to skip the symmetry check.

4 Likes