Hello all,
For the longest time, we haven’t been able to leverage the BLAS and LAPACK in Apple’s Accelerate largely because it was LP64 only and carried a really old version of LAPACK. This didn’t matter much on Intel macs, because one could use OpenBLAS which is quite good and also MKL.
Of course, with Apple Silicon (which is now Tier-1), everything changed. Accelerate can offer much higher performance than OpenBLAS (at least as of right now). Listening to us (@staticfloat mostly), Apple introduced ILP64 as well as a modern LAPACK in macOS 13.3. @staticfloat updated AppleAccelerate.jl so that it could become a libblastrampoline backend. Upon doing that, we found an issue in pivoted Cholesky, which is fixed in macOS 13.4. Interestingly, using LBT’s overlay mechanism, we could actually override the buggy Accelerate version with one in LAPACK_jll.
As a result, AppleAccelerate.jl 0.4.0 was finally tagged and released. The performance difference is significant for certain matmuls on Apple Silicon:
julia> peakflops(4096) # OpenBLAS
3.6024175318268243e11
julia> using AppleAccelerate
julia> peakflops(4096)
5.832806459434183e11
It also works on Intel macs giving marginally better performance in some cases. Naturally all of this needs a lot more testing and experimenting, so please try it out.
-viral