Alternate BLAS libraries?

I’m running some processor heavy stochastic simulations that are increasingly limited by BLAS matrix dot multiplications on 9*9 matrices, as I gradually add optimisations for everything else.

It seems that MKL (and maybe other BLAS libraries) can be significantly faster on intel CPUs than the OpenBlas that ships with julia, especially with smaller matrices.

But I have no experience using them with Julia. So my questions are:

Is MKL actually faster for this kind of problem?
How easy is it to swap Blas implementation?
Is there a simple guide somewhere?
Is using MKL.jl the recommended approach?
What are the downsides?

1 Like

Avoiding BLAS is fastest in this size range. Use StaticArrays instead.

That doesn’t seem to hold for a 9*9 dot product, although it might in the simulation context.

julia> a1 = rand(9, 9)
9×9 Array{Float64,2}:

julia> a2 = rand(9, 9)
9×9 Array{Float64,2}:

julia> s1 = SArray{Tuple{9,9}}(a1)
9×9 SArray{Tuple{9,9},Float64,2,81}:

julia> s2 = SArray{Tuple{9,9}}(a2)
9×9 SArray{Tuple{9,9},Float64,2,81}:

julia> @btime dot($a1, $a2)
  18.505 ns (0 allocations: 0 bytes)
21.148770113250393

julia> @btime dot($s1, $s2)
  45.989 ns (0 allocations: 0 bytes)
21.148770113250404

Oh wow, interesting. I’m not sure why it’s slower at that size. I always remembered that if it’s <~100 static arrays would do better.

Yeah I thought the same thing, I’m not sure why its so much slower.

This is interesting:

julia> @btime dot($a1, $a2)
  22.762 ns (0 allocations: 0 bytes)
18.782035065611815

julia> @btime dot($s1, $s2)
  70.213 ns (0 allocations: 0 bytes)
18.782035065611826

julia> @btime dot($(vec(a1)), $(vec(a2)))
  23.527 ns (0 allocations: 0 bytes)
18.782035065611815

julia> @btime dot($(vec(s1)), $(vec(s2)))
  15.570 ns (0 allocations: 0 bytes)
18.782035065611815

It seems like dot for SArrays is not optimized very well.

1 Like

Yeah it could really be optimized a lot more. The performance crossover is between 5*5 and 6*6,
so at ~30 instead of ~100:

2 * 2 Array:
  8.122 ns (0 allocations: 0 bytes)
2 * 2 SArray:
  1.284 ns (0 allocations: 0 bytes)
3 * 3 Array:
  9.596 ns (0 allocations: 0 bytes)
3 * 3 SArray:
  2.439 ns (0 allocations: 0 bytes)
4 * 4 Array:
  9.134 ns (0 allocations: 0 bytes)
4 * 4 SArray:
  5.206 ns (0 allocations: 0 bytes)
5 * 5 Array:
  14.416 ns (0 allocations: 0 bytes)
5 * 5 SArray:
  9.755 ns (0 allocations: 0 bytes)
6 * 6 Array:
  14.514 ns (0 allocations: 0 bytes)
6 * 6 SArray:
  17.923 ns (0 allocations: 0 bytes)
7 * 7 Array:
  10.875 ns (0 allocations: 0 bytes)
7 * 7 SArray:
  25.465 ns (0 allocations: 0 bytes)
8 * 8 Array:
  12.833 ns (0 allocations: 0 bytes)
8 * 8 SArray:
  33.996 ns (0 allocations: 0 bytes)
9 * 9 Array:
  16.112 ns (0 allocations: 0 bytes)
9 * 9 SArray:
  46.050 ns (0 allocations: 0 bytes)
10 * 10 Array:
  18.774 ns (0 allocations: 0 bytes)
10 * 10 SArray:
  63.568 ns (0 allocations: 0 bytes)
1 Like

I’ve opened an issue on Github about this. It seems that for now though, calling dot(vec(s1), vec(s2)) works fine as a workaround.

2 Likes

But you have vec inside the $() so its not being timed?

vec just reinterprets the underlying memory, so it usually doesn’t come with any overhead:

julia> @btime dot(vec($s1), vec($s2))
  15.572 ns (0 allocations: 0 bytes)
18.782035065611815

julia> @btime dot($(vec(s1)), $(vec(s2)))
  15.921 ns (0 allocations: 0 bytes)
18.782035065611815

Oh right. Looks like thats the fastest option then.

Looks like it is fixed now, so I would try it again with StaticArrays master.

4 Likes

Did you happen to try starting julia with the -O3 option? I’m not sure if it is relevant here but I have seen in other discussions that -O3 can affect how aggressively SIMD evaluation is used.

I don’t know if it applies generally, but I find that these very short @btime values are unstable, e.g. for the 9x9 dot calls above I get apparently random draws from about [15,40]ns on one system for any decent algorithm (including a simple @simd for loop). So I would @btime a batch of them.

A thread with this title should not go by without a vote of encouragement to @Elrod, whose PaddedMatrices package (with help from its many friends and relatives), while yet young, is
shaping up to be in pole position for future Julia BLAS races. If I look at the small matrix-matrix multiplication problem (much harder than what is discussed above), it beats OpenBLAS and StaticArrays handily and is as good as MKL. Unlike StaticArrays, which chokes to death on large arrays, I find that PaddedMatrices matches the external BLAS libraries for large gemm. (I won’t claim generality since I’ve only tried it on Intel AVX2 systems - Chris shows even better results on AVX512.)

9 Likes

I find that PaddedMatrices matches the external BLAS libraries for large gemm .

I wish I could take credit, but it should be exactly as fast because (for now), if M*K*N >= 104^3, it just uses Julia’s BLAS library (it calls LinearAlgebra.BLAS.gemm!). That was roughly the size at which MKL caught up on my system.

After fixing broadcasting and cleaning up/ refactoring the code, adding tests, and then registering it, I’ll play around again with improving performance at larger sizes via packing, multithreading, and maybe prefetching.
Prefetching is hard, because early tests suggested what works best varies a lot from I’ve system to another. Although the library has access to L1 cache sizes, so as a first test, I plan on fetching as soon as more L1 space becomes available.

The idea would be to steadily increase the size after which Julia’s linked BLAS is faster

4 Likes

Not since 0.7 where the SLP optimization pass was included into -O2.

1 Like

Padded matrices may be the best option.

I’ve been finding that my other optimizations are hard to replicate efficiently with static arrays. Optimizations to reduce main memory access, like shifting a matrix left one column and adding a new column from the much larger source array are really easy with a Matrix, but fiddly with an SMatrix. My first attempts are all a lot slower than just using a matrix and BLAS.

Maybe I need to get the hang of SArrays and think about the code generation a bit more.

Edit: being able to handle a whole range of arrays competetively is another key reason to use PaddedMatrices. I wont always be chosing the matrix sizes, other ecologists will be based on how big they need to be to represent a problem, and they might be 3*3 or 20*20.

To round this discussion out: I’m going to install MKL.jl. In my context it seems easier to optimise mutable arrays than various kinds of static arrays (I didn’t realise PaddedArrays were also tuples underneath). I can’t get within 2x of copyto() and OpenBLAS dot multiply using Static or Padded Arrays, and I’ve spent longer on this than the original version.

It could be a function of how much I know about one method vs the other, but I’m thinking it’s not always easy to make something faster by throwing static arrays at it.

PaddedMatrices also offers a dynamically sized type (which just wraps a regular Julia array).
If you have any example of PaddedMatrices getting subpar performance, please let me know and I’ll try to fix it.

1 Like

Need some docs :wink:

PaddedMatrices with DynamicPadedArray gave a 4x slowdown over MKL and a standard array. Mostly because Float32 dot multiplication is much slower. This could be caused by cache interactions with the larger array size, its not so easy to say exactly what is taking time just using the profiler.

The whole simulation takes 20% less time with MKL than OpenBLAS, and dot multiplication is a little over half of the time. So MKL is a lot faster than OpenBLAS for this task, and super easy to install.

1 Like