Why the time consumed of LAPACK.potrf! increase hugely when the dimension is bigger than a special number

When I use LAPACK.potrf! do Cholesky decomposition, I met this question.

using StaticArrays
using TimerOutputs
using LinearAlgebra
function test3_1(Quu)
    @timeit "LAPACK.potrf!" LAPACK.potrf!('U',Quu)
end
n = 16
Quu14 = rand(n,n);
Quu14 = (Quu14+Quu14') + Diagonal(ones(n)*n);
K = zeros(n,30); d = zeros(n);
reset_timer!()
for i in 1:20
    Quu_ = deepcopy(Quu14)
    @timeit "test3_1" test3_1(Quu_)
end
show(TimerOutputs.get_defaulttimer());

When n = 16, the time consumed of LAPACK.potrf! is only about 1us;
but if I set n = 17, and the time consumed of LAPACK.potrf! increase hugely to 80us.

Why? And I find that 16*16=256, so may be this question is related to the memory size used by LAPACK? My computer memory is 16G, and the system is Windows 10 64 bit.
May be I did something wrong, and how can I solve it.

Thank you very much.

I don’t think you did anything wrong, but it’s important to keep in mind that LAPACK is optimized for solving larger problems efficiently using blocking that enables the use of efficient level 3 BLAS routines. The factorization routines first do blocking at the factorization level based on an estimated optimal block size. I think that block size is larger than 16, so that’s probably not what you are observing. For problems smaller than the optimal block size, dpotrf calls dpotrf2 which is nominally is a fall back based on level 2 BLAS, but it operates recursively and leverages some level 3 BLAS routines for block updates associated with the recursion. Those level 3 BLAS routines will potentially do their own blocking, depending on what they deem optimal. Figuring out how all that impacts run times on small matrices is not that easy, but there are definitely potential matrix sizes at which different options kick in. So I would expect some jumps in run time.

As great as LAPACK is, it’s really not optimized for small matrices. Even a really simple-minded Cholesky routine that could be thrown together from a numerical analysis text such as

function basic_cholesky!(A)
  n = size(A,1)
  for j in 1:n
    x=zero(eltype(A))
    for k in 1:j-1
      x += conj(A[k,j])*A[k,j]
    end
    A[j,j] = sqrt(A[j,j] - x)
    for i in j+1:n
      x=zero(eltype(A))
      for k = 1:j-1
        x += conj(A[k,j])*A[k,i]
      end
      A[j,i] = (A[j,i] - x)/A[j,j]
    end
  end
  UpperTriangular(A)
end

beats potrf! on my machine for n=16 with your setup and doesn’t seem to have much in the way of notable jumps in run times as n is increased. If you really want to factor lots of small matrices and you are worried about performance, you might want to consider something simpler than LAPACK. Of course if this is just an artificial test on small matrices, and you really want to factor larger dense matrices, LAPACK is definitely the way to go.

It’s great, thank you very much.
By the way, I find the similar function ‘_chol!’ in Julia _chol!, but it call the LAPACK.potrf! directly through multiple dispatch for Matrix{Float64}. So is it better to do some optimization inside Julia LinearAlgebra before calling LAPACK for small matrix?

Since Julia (like Python) does not know private/public attributes, an underscore conventionally marks a function or variable as private. Do not use those methods directly. They may break without warning between versions.

You can use LinearAlgebra.cholesky which returns nice structures.

julia> n=16; A = Symmetric(rand(16,16) + n*I);

julia> cholesky(A)
Cholesky{Float64, Matrix{Float64}}
U factor:
[...]

Ultimately the LAPACK routine will be called though. When I benchmark, msteward’s implementation is about 20% faster on 16x16 matrices than the built-in potrf!.

For matrices that small, you want to consider using StaticArrays which has its own Cholesky. That should speed things up considerably.

As @skleinbo suggested, you shouldn’t use _chol!. I played around with this a little bit. On my machine, cholesky (not cholesky!) from StaticArraysjl is about 20% faster than the naive code (revised) for n\leq 24. For n>24, StaticArrays.jl appears to call LAPACK with some apparent overhead that makes it the slowest of the three. LAPACK starts to win for n>110 or so. So what is best is fairly size dependent.

If it makes sense to use it (feel free), I did make a few refinements to my simple code so it’s a little more robust and a little faster:

function basic_cholesky!(H::Hermitian{T,<:AbstractMatrix{T}}) where {T}
  A = H.data
  H.uplo == 'U' ||
    error("Hermitian matrix should be in upper triangular storage.")
  @inbounds for i in axes(A, 1)
    x = zero(T)
    @simd for k = first(axes(A, 1)):(i - 1)
      x += conj(A[k, i]) * A[k, i]
    end
    s = H[i, i] - x # Use H[i,i] since it will be real.
    real(s) > 0 || throw(PosDefException(i))
    A[i, i] = sqrt(s)
    for j = (i + 1):last(axes(A, 2))
      x = zero(T)
      @simd for k = first(axes(A, 1)):(i - 1)
        x += conj(A[k, i]) * A[k, j]
      end
      A[i, j] = (A[i, j] - x) / A[i, i]
    end
  end
  UpperTriangular(A)
end

Still pretty basic and not solidly tested, but perhaps a little better. (No warranty.) It would be nice to index into H directly in most places, so it wouldn’t matter whether upper or lower triangular storage was used, but that was a bit slower.

2 Likes