I’m running some processor heavy stochastic simulations that are increasingly limited by BLAS matrix dot multiplications on 9*9 matrices, as I gradually add optimisations for everything else.
It seems that MKL (and maybe other BLAS libraries) can be significantly faster on intel CPUs than the OpenBlas that ships with julia, especially with smaller matrices.
But I have no experience using them with Julia. So my questions are:
Is MKL actually faster for this kind of problem?
How easy is it to swap Blas implementation?
Is there a simple guide somewhere?
Is using MKL.jl the recommended approach?
What are the downsides?
Did you happen to try starting julia with the -O3 option? I’m not sure if it is relevant here but I have seen in other discussions that -O3 can affect how aggressively SIMD evaluation is used.
I don’t know if it applies generally, but I find that these very short @btime values are unstable, e.g. for the 9x9 dot calls above I get apparently random draws from about [15,40]ns on one system for any decent algorithm (including a simple @simd for loop). So I would @btime a batch of them.
A thread with this title should not go by without a vote of encouragement to @Elrod, whose PaddedMatrices package (with help from its many friends and relatives), while yet young, is
shaping up to be in pole position for future Julia BLAS races. If I look at the small matrix-matrix multiplication problem (much harder than what is discussed above), it beats OpenBLAS and StaticArrays handily and is as good as MKL. Unlike StaticArrays, which chokes to death on large arrays, I find that PaddedMatrices matches the external BLAS libraries for large gemm. (I won’t claim generality since I’ve only tried it on Intel AVX2 systems - Chris shows even better results on AVX512.)
I find that PaddedMatrices matches the external BLAS libraries for large gemm .
I wish I could take credit, but it should be exactly as fast because (for now), if M*K*N >= 104^3, it just uses Julia’s BLAS library (it calls LinearAlgebra.BLAS.gemm!). That was roughly the size at which MKL caught up on my system.
After fixing broadcasting and cleaning up/ refactoring the code, adding tests, and then registering it, I’ll play around again with improving performance at larger sizes via packing, multithreading, and maybe prefetching.
Prefetching is hard, because early tests suggested what works best varies a lot from I’ve system to another. Although the library has access to L1 cache sizes, so as a first test, I plan on fetching as soon as more L1 space becomes available.
The idea would be to steadily increase the size after which Julia’s linked BLAS is faster
I’ve been finding that my other optimizations are hard to replicate efficiently with static arrays. Optimizations to reduce main memory access, like shifting a matrix left one column and adding a new column from the much larger source array are really easy with a Matrix, but fiddly with an SMatrix. My first attempts are all a lot slower than just using a matrix and BLAS.
Maybe I need to get the hang of SArrays and think about the code generation a bit more.
Edit: being able to handle a whole range of arrays competetively is another key reason to use PaddedMatrices. I wont always be chosing the matrix sizes, other ecologists will be based on how big they need to be to represent a problem, and they might be 3*3 or 20*20.
To round this discussion out: I’m going to install MKL.jl. In my context it seems easier to optimise mutable arrays than various kinds of static arrays (I didn’t realise PaddedArrays were also tuples underneath). I can’t get within 2x of copyto() and OpenBLAS dot multiply using Static or Padded Arrays, and I’ve spent longer on this than the original version.
It could be a function of how much I know about one method vs the other, but I’m thinking it’s not always easy to make something faster by throwing static arrays at it.
PaddedMatrices also offers a dynamically sized type (which just wraps a regular Julia array).
If you have any example of PaddedMatrices getting subpar performance, please let me know and I’ll try to fix it.
PaddedMatrices with DynamicPadedArray gave a 4x slowdown over MKL and a standard array. Mostly because Float32 dot multiplication is much slower. This could be caused by cache interactions with the larger array size, its not so easy to say exactly what is taking time just using the profiler.
The whole simulation takes 20% less time with MKL than OpenBLAS, and dot multiplication is a little over half of the time. So MKL is a lot faster than OpenBLAS for this task, and super easy to install.