Curiosity killed the cat. I know. But I was curious: how would a naïve implementation of the LU factorization fare compared to what is built in and supported by BLAS 1, 2, 3? The result surprised me as I was able to beat the built-in lu!()! Not by much, but still. (EDIT: Note well, the beaten version was the generic LU factorization, not the blas-supported one.)
The complete code is here. (Note: both versions of the factorization are without pivoting.)
That’s because lu(a, Val(false)) does not call into BLAS and instead uses the fallback generic_lufact!. Nevertheless, your naive code benchmarks faster than the LinearAlgebra generic fallback, so it would be interesting to see why and whether the stdlib version can be improved.
Tip: Don’t benchmark the inplace versions, use copy. Otherwise you will perform the next lu on the result of the last lu, with the possibility of hitting slow subnormals in the benchmark loop. The cost of creating a copy is negligible compared to the cost of the lu factorization.
With the suggestion of @juthohaegeman (concerning the setup of the benchmark test), and in order to test the blas-supported solver (running @btime lu!(b, Val(true)) setup=(b=copy($a)), i. e. with pivoting enabled), I get the following graph:
The built-in generic_lufact! leaves some performance on the table in the loop
The performance of the naïve implementation is matched (for larger number of equations) if this loop is replaced as
# Update the rest
for j = k+1:n
nAkj = -A[k,j]
for i = k+1:m
A[i,j] += A[i,k]*nAkj
end
end
When pivoting is turned off, generic_lufact! still allocates the pivot-position vector, and goes through the motions of working with it. This costs some performance for small matrix sizes. I suppose that is the price of doing business with the LU factorization object which expects the pivot vector during construction.
I guess there is a lesson here: For larger matrices, even if one is sure that no pivoting is necessary, it should not be turned off, because the call is routed to generic_lufact! with an attendant order-of-magnitude slow down. For small matrices it may pay to avoid the creation of the factorization object and use the naïve factorization.
It should probably be noted in the docstring that the routine without pivoting is not BLAS, as this is a non trivial performance gotcha. OpenBLAS is not very fast for small matrices, it would be interesting to see what MKL does here.
The other kind of gotcha in loops is also really important: We know that the write to A[i,j]does not alias the read of A[k,j], but the compiler doesn’t know that. At least that’s what I think is going on.
I have to admit I don’t know yet what to propose. One thing is the documentation, but there’s also the possibility of producing some code that would allow for additional optimization of the choice of the method based on the size of the problem.
The pivot=false option is basically only useful for pedagogy (comparing to hand calculations or toy code for tiny matrices), and should never be used in real problems because it can make the calculation numerically unstable. Improving its performance would serve no practical purpose.
I agree that the documentation should be clearer on the meaning of the pivot argument.
That is a fair point. But we shouldn’t miss the important part of the story: the performance of the BLAS-supported LU factorization is erratic for smaller matrix sizes, and in fact it is outperformed by the generic Julia version (both WITH PIVOTING).
With just a small change to the code (the “better” implementation of the generic LU factorization), the performance can be improved further.
The break-even point between lu! and generic_lufact! is around 200 equations. NB: For 10 equations or less, the static-array solver implementation may provide further improvements in speed with the generic version of the factorization. (I don’t know if the static array can be passed to the BLAS-supported solver.)
I think it might be of interest to compare with the MKL solver. If anyone has access to it, would you please run the code in Testing LU · GitHub and post the generated graph?
EDIT: Additional results for complex matrices. Same computer as above.
Also be careful that benchmarking on small matrices is tricky (cache and branch predictions effects), although that probably doesn’t explain why BLAS is so slow.