Alternate BLAS libraries?

Raf · July 13, 2019, 9:23am

I’m running some processor heavy stochastic simulations that are increasingly limited by BLAS matrix dot multiplications on 9*9 matrices, as I gradually add optimisations for everything else.

It seems that MKL (and maybe other BLAS libraries) can be significantly faster on intel CPUs than the OpenBlas that ships with julia, especially with smaller matrices.

But I have no experience using them with Julia. So my questions are:

Is MKL actually faster for this kind of problem?
How easy is it to swap Blas implementation?
Is there a simple guide somewhere?
Is using MKL.jl the recommended approach?
What are the downsides?

ChrisRackauckas · July 13, 2019, 10:35am

Avoiding BLAS is fastest in this size range. Use StaticArrays instead.

Raf · July 13, 2019, 11:01am

That doesn’t seem to hold for a 9*9 dot product, although it might in the simulation context.

julia> a1 = rand(9, 9)
9×9 Array{Float64,2}:

julia> a2 = rand(9, 9)
9×9 Array{Float64,2}:

julia> s1 = SArray{Tuple{9,9}}(a1)
9×9 SArray{Tuple{9,9},Float64,2,81}:

julia> s2 = SArray{Tuple{9,9}}(a2)
9×9 SArray{Tuple{9,9},Float64,2,81}:

julia> @btime dot($a1, $a2)
  18.505 ns (0 allocations: 0 bytes)
21.148770113250393

julia> @btime dot($s1, $s2)
  45.989 ns (0 allocations: 0 bytes)
21.148770113250404

ChrisRackauckas · July 13, 2019, 11:09am

Oh wow, interesting. I’m not sure why it’s slower at that size. I always remembered that if it’s <~100 static arrays would do better.

Raf · July 13, 2019, 11:12am

Yeah I thought the same thing, I’m not sure why its so much slower.

simeonschaub · July 13, 2019, 11:13am

This is interesting:

julia> @btime dot($a1, $a2)
  22.762 ns (0 allocations: 0 bytes)
18.782035065611815

julia> @btime dot($s1, $s2)
  70.213 ns (0 allocations: 0 bytes)
18.782035065611826

julia> @btime dot($(vec(a1)), $(vec(a2)))
  23.527 ns (0 allocations: 0 bytes)
18.782035065611815

julia> @btime dot($(vec(s1)), $(vec(s2)))
  15.570 ns (0 allocations: 0 bytes)
18.782035065611815

It seems like dot for SArrays is not optimized very well.

Raf · July 13, 2019, 11:23am

Yeah it could really be optimized a lot more. The performance crossover is between 5*5 and 6*6,
so at ~30 instead of ~100:

2 * 2 Array:
  8.122 ns (0 allocations: 0 bytes)
2 * 2 SArray:
  1.284 ns (0 allocations: 0 bytes)
3 * 3 Array:
  9.596 ns (0 allocations: 0 bytes)
3 * 3 SArray:
  2.439 ns (0 allocations: 0 bytes)
4 * 4 Array:
  9.134 ns (0 allocations: 0 bytes)
4 * 4 SArray:
  5.206 ns (0 allocations: 0 bytes)
5 * 5 Array:
  14.416 ns (0 allocations: 0 bytes)
5 * 5 SArray:
  9.755 ns (0 allocations: 0 bytes)
6 * 6 Array:
  14.514 ns (0 allocations: 0 bytes)
6 * 6 SArray:
  17.923 ns (0 allocations: 0 bytes)
7 * 7 Array:
  10.875 ns (0 allocations: 0 bytes)
7 * 7 SArray:
  25.465 ns (0 allocations: 0 bytes)
8 * 8 Array:
  12.833 ns (0 allocations: 0 bytes)
8 * 8 SArray:
  33.996 ns (0 allocations: 0 bytes)
9 * 9 Array:
  16.112 ns (0 allocations: 0 bytes)
9 * 9 SArray:
  46.050 ns (0 allocations: 0 bytes)
10 * 10 Array:
  18.774 ns (0 allocations: 0 bytes)
10 * 10 SArray:
  63.568 ns (0 allocations: 0 bytes)

simeonschaub · July 13, 2019, 11:27am

I’ve opened an issue on Github about this. It seems that for now though, calling dot(vec(s1), vec(s2)) works fine as a workaround.

Raf · July 13, 2019, 11:28am

But you have vec inside the $() so its not being timed?

simeonschaub · July 13, 2019, 11:31am

vec just reinterprets the underlying memory, so it usually doesn’t come with any overhead:

julia> @btime dot(vec($s1), vec($s2))
  15.572 ns (0 allocations: 0 bytes)
18.782035065611815

julia> @btime dot($(vec(s1)), $(vec(s2)))
  15.921 ns (0 allocations: 0 bytes)
18.782035065611815

Raf · July 13, 2019, 11:34am

Oh right. Looks like thats the fastest option then.

stevengj · July 13, 2019, 4:03pm

Looks like it is fixed now, so I would try it again with StaticArrays master.

dmbates · July 13, 2019, 6:33pm

Did you happen to try starting julia with the -O3 option? I’m not sure if it is relevant here but I have seen in other discussions that -O3 can affect how aggressively SIMD evaluation is used.

Ralph_Smith · July 14, 2019, 2:48am

I don’t know if it applies generally, but I find that these very short @btime values are unstable, e.g. for the 9x9 dot calls above I get apparently random draws from about [15,40]ns on one system for any decent algorithm (including a simple @simd for loop). So I would @btime a batch of them.

A thread with this title should not go by without a vote of encouragement to @Elrod, whose PaddedMatrices package (with help from its many friends and relatives), while yet young, is
shaping up to be in pole position for future Julia BLAS races. If I look at the small matrix-matrix multiplication problem (much harder than what is discussed above), it beats OpenBLAS and StaticArrays handily and is as good as MKL. Unlike StaticArrays, which chokes to death on large arrays, I find that PaddedMatrices matches the external BLAS libraries for large gemm. (I won’t claim generality since I’ve only tried it on Intel AVX2 systems - Chris shows even better results on AVX512.)

Elrod · July 14, 2019, 3:37am

I find that PaddedMatrices matches the external BLAS libraries for large gemm .

I wish I could take credit, but it should be exactly as fast because (for now), if M*K*N >= 104^3, it just uses Julia’s BLAS library (it calls LinearAlgebra.BLAS.gemm!). That was roughly the size at which MKL caught up on my system.

After fixing broadcasting and cleaning up/ refactoring the code, adding tests, and then registering it, I’ll play around again with improving performance at larger sizes via packing, multithreading, and maybe prefetching.
Prefetching is hard, because early tests suggested what works best varies a lot from I’ve system to another. Although the library has access to L1 cache sizes, so as a first test, I plan on fetching as soon as more L1 space becomes available.

The idea would be to steadily increase the size after which Julia’s linked BLAS is faster

kristoffer.carlsson · July 14, 2019, 6:05am

Not since 0.7 where the SLP optimization pass was included into -O2.

Raf · July 14, 2019, 7:55am

Padded matrices may be the best option.

I’ve been finding that my other optimizations are hard to replicate efficiently with static arrays. Optimizations to reduce main memory access, like shifting a matrix left one column and adding a new column from the much larger source array are really easy with a Matrix, but fiddly with an SMatrix. My first attempts are all a lot slower than just using a matrix and BLAS.

Maybe I need to get the hang of SArrays and think about the code generation a bit more.

Edit: being able to handle a whole range of arrays competetively is another key reason to use PaddedMatrices. I wont always be chosing the matrix sizes, other ecologists will be based on how big they need to be to represent a problem, and they might be 3*3 or 20*20.

Raf · July 14, 2019, 1:54pm

To round this discussion out: I’m going to install MKL.jl. In my context it seems easier to optimise mutable arrays than various kinds of static arrays (I didn’t realise PaddedArrays were also tuples underneath). I can’t get within 2x of copyto() and OpenBLAS dot multiply using Static or Padded Arrays, and I’ve spent longer on this than the original version.

It could be a function of how much I know about one method vs the other, but I’m thinking it’s not always easy to make something faster by throwing static arrays at it.

Elrod · July 14, 2019, 4:11pm

PaddedMatrices also offers a dynamically sized type (which just wraps a regular Julia array).
If you have any example of PaddedMatrices getting subpar performance, please let me know and I’ll try to fix it.

Raf · July 15, 2019, 12:39am

Need some docs

PaddedMatrices with DynamicPadedArray gave a 4x slowdown over MKL and a standard array. Mostly because Float32 dot multiplication is much slower. This could be caused by cache interactions with the larger array size, its not so easy to say exactly what is taking time just using the profiler.

The whole simulation takes 20% less time with MKL than OpenBLAS, and dot multiplication is a little over half of the time. So MKL is a lot faster than OpenBLAS for this task, and super easy to install.

Topic		Replies	Views
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13667	October 30, 2019
Performance details of StaticArray Performance	5	3135	June 7, 2018
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5899	July 5, 2020
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3013	July 16, 2018
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36747	June 19, 2020

Alternate BLAS libraries?

Related topics