[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays

PaddedMatrices is an experimental and not well tested. Issues/PRs/comments are all welcome.

The package is registered, so you can install via ] add PaddedMatrices.

More functionality is coming in the future, but the two main things it does now:

  1. Julia BLAS.
  2. Optionally sized arrays, dispatching on the Julia BLAS by default.

Initializing Arrays

You can create random FixedSizeArrays via the @FixedSize macro. This macro currently only accepts rand, and randn for generating Float64 elements.

Creating random arrays
julia> using PaddedMatrices

julia> A = @FixedSize rand(8,8)
8×8 FixedSizeArray{Tuple{8,8},Float64,2,Tuple{1,8},71} with indices 1:1:8×1:1:8:
 0.683092  0.0519738    0.123364  0.894025    0.13376   0.148439  0.604513  0.689693
 0.318745  0.751111     0.214771  0.221637    0.782232  0.329274  0.210776  0.151637
 0.691374  0.0600692    0.317934  0.25582     0.881472  0.734963  0.603998  0.855192
 0.674639  0.717255     0.60677   0.343281    0.119153  0.340493  0.272141  0.202147
 0.992593  0.0692194    0.256404  0.922214    0.129515  0.473576  0.863022  0.749924
 0.115591  0.764342     0.484487  0.76594     0.970529  0.474724  0.5216    0.346497
 0.214091  0.931094     0.622827  0.00131706  0.100016  0.103876  0.27451   0.912466
 0.82846   0.000283631  0.784817  0.716793    0.711505  0.336038  0.723893  0.824162

julia> x = @FixedSize rand(3)
3-element FixedSizeArray{Tuple{3},Float64,1,Tuple{1},15} with indices 1:1:3:
 0.09671375198196175
 0.5290890298633636
 0.2891468539564671

julia> Y = @FixedSize randn(2,3,4)
2×3×4 FixedSizeArray{Tuple{2,3,4},Float64,3,Tuple{1,2,6},31} with indices 1:1:2×1:1:3×1:1:4:
[:, :, 1] =
 0.162202  -0.122953  0.268173
 1.92397    1.85739   0.738878

[:, :, 2] =
 -0.791952  -0.181461  -0.515349
  0.160009  -0.827614   1.59702

[:, :, 3] =
 1.17112   -0.902177  -0.949055
 0.863118  -1.24432    0.94861

[:, :, 4] =
 -0.914365  -0.681319  0.428368
 -1.93309   -0.552553  0.859877

These RNGs should be relatively fast. You can set the seed with PaddedMatrices.VectorizedRNG.seed!, which is independent of Random.seed!. Setting the seed is not thread-safe (it effects all threads).

Benchmarking small, fixed size matrix multiplication

Small, fixed-size, array benchmarks, comparing with SMatrix and MMatrix from StaticArrays.jl.
PtrArray is the fastest at all sizes, but SMatrix is faster than the FixedSizeArray for 2x2, 3x3 and 8x8 matrices on my computer, while MMatrix is only faster for 3x3 matrices (but not faster than PaddedArray). Starting at 6x6, MMatrix is slower than the jmul! method applied to regular-old Array{Float64,2}s.


When creating FixedSizeArrays, they are padded by default. This is to align their columns in memory, which makes indexing the top of their columns faster, improving performance slightly.
The PtrArray type has lower overhead, for some reason. The FixedSizeArray is a mutable struct that holds an NTuple (the memory), and an aligned pointer to that memory. PtrArray is just a struct holding the pointer; any use must be protected with GC.@preserve memory.

Beyond 8x8, multiplying regular (dynamically sized) Matrix{Float64} using PaddedMatrices.jmul! was faster than SMatrix and MMatrix multiplication. Of course, the StaticArrays documentation recommends them primarily for 10x10 arrays and smaller. Their compile times regress significantly beyond that, even more than the runtimes.

Worth pointing out that a low-hanging fruit for StaticArrays performance is to start using muladd. The SMatrix type also has the advantage of getting the best performance out of a more convenient, non-mutating, API. The others were all benchmarked with in-place variants.

Results probably look different depending on instruction set.

Large, dynamically-sized array multiplication (single-threaded)

These benchmarks all used the base Matrix{T} type, comparing OpenBLAS 0.3.9, Intel MKL through ccall, Gaius.jl and PaddedMatrices.jmul!.
Float64 benchmarks:


OpenBLAS had the best performance for large arrays, while MKL had the best performance between 40x40 and 200x200.
Once the arrays no longer fit in the L2 cache (beyond the 200x200), PaddedMatrices’ performance dropped, and it never really recovered. This suggests the library still needs work on how it handles memory bandwidth.

It implements this algorithm (image belongs to the BLIS team):

You can see the default tiling parameters for your architecture via

julia> mc, kc, nc = PaddedMatrices.matmul_params(Float64, Float64, Float64)
(200, 300, 600)

Arguments are the element types of C, A, and B, respectively, for PaddedMatrices.jmul!(C, A, B) (meaning C = A * B).
The three return values are the size of outer-loop increments for the M, K, and N loops, respectively, where C is M x N, A is M x K, and B is K x N.
That is, it will work with mc x kc tiles of A and kc x nc tiles of B at a time, to calculate mc x nc tiles of C.
These sizes are meant to both (a) let the A tiles fit in L2 cache and the B tiles in L3, following the above diagram, and (b) be integer multiples of the dimension of the micro-kernels.

If you would like to play around with different values, you can call

PaddedMatrices.jmul!(C, A, B, Val(mc), Val(kc), Val(nc))

But be careful – you’ll likely get corrupted answers or a crash if you make the tiles of A and B much larger than what PaddedMatrices.matmul_params returns.
You can get the size of the micro kernel via:

julia> (PaddedMatrices.mᵣ * PaddedMatrices.VectorizationBase.pick_vector_width(Float64), PaddedMatrices.nᵣ)
(40, 5)

This means (on this architecture) that you’d want to keep mc a multiple of 40, and nc a multiple of 5.

Gaius.jl (which doesn’t yet use the same algorithm) uses a “cache oblivious” recursive algorithm, but unfortunately cache (and TLB)-aware tiling algorithms are needed to keep the memory-boogeyman at bay.

Using Float32, OpenBLAS and MKL are neck and neck for large arays, while MKL still dominates at the more moderate sizes:

Interestingly, PaddedMatrices gets a little over twice the GFLOPS with single precision (instead of just twice).

Those BLAS libraries don’t support integer multiplication, so that provides an area where where we have a lot to gain. Benchmarking vs LinearAlgebra’s generic matmul for Int64 (note the mislabled y-axis, it should be GIOPS, because these aren’t floating point operations):


We get a lot less operations per second out of Int64 than Float64, primarilly because SIMD Int64 multiplication is very slow. On AVX512 architectures, like the computer where I ran these benchmarks, it is implemented using the vpmullq instruction, which has at best 1/3 the throughput as fma.
Without AVX512, but with AVX2, it’s multiplied using Int32 multiplication, shifts, and additions.
Without AVX2, you don’t have SIMD integer operations.

Int32 does better:


But is still slower than Float64. Note that Float64 can exactly represent every integer from -(2^53) to 2^53, which is better than Int32, which can only represent -(2^31) to 2^31-1.

So maybe integer matrix multiplication isn’t that useful. Still, it’s nice to see the performance you can gain vs a more naive implementations.

Performance relative to OpenBLAS/MKL will likely be a little worse on AVX2, because PaddedMatrices is currently using prefetch instructions to help with cache latency with AVX512, but the way I added calls to them resulted in using an excessive amount with AVX2 to the point that performance was worse than not using any.

Large, dynamically-sized array multiplication (multi-threaded)

There is a function PaddedMatrices.jmult!, but performance is bad at the moment.
I had it spawn a new task for every tile of A and B. This is too few tasks to get good utilization out of a many-core CPU for all but very large matrices, but for very-large matrices the, number of tasks then quickly becomes excessive.

First of all, note that you can parallelize the M and N loops, corresponding to blocks of A and B, respectively. Parallelizing the K loop would be awkward (e.g. race conditions on updating C?), so I’ll ignore that possibility.

We could imagine a simple “end game” algorithm for extremely large matrices:

For very large matrices A, with enough rows that you can divide reasonable blocks up among cores and not have an awkward remainder.
If the overhead on spawning tasks were smaller, the best approach for very large matrices may be just a slight modification of the current algorithm: multiply the size of the tiles of B by Threads.nthreads(). These are supposed to live in the L3 cache, which is shared between cores. Then spawn separate tasks, which take tiles of A (in the L2 cache, local to specific cores) and multiply them with the giant tile of B.
This would have the advantage of enabling very big tiles of B, amortizing the cost of packing the A-tiles.

To generalize this for matrices too small to qualify for 1., we could find out how many blocks of B we need to get good utilization of the CPU cores, then divide up the L3 space for that many blocks of B. Parallelize these B blocks, and the A blocks within them.
This algoirthm transitions nicely to the extreme case above.

If matrices are too small, we could start shrinking B and A block sizes for the sake of increased parallelism.

FixedSizeArrays use LoopVectorization for broadcasting

julia> using PaddedMatrices, StaticArrays, BenchmarkTools

julia> Afs = @FixedSize randn(7,9); Asm = SMatrix{7,9}(Array(Afs));

julia> bfs = @FixedSize rand(7); bsv = SVector{7}(bfs);

julia> @btime @. $Afs + log($bfs)
  74.276 ns (1 allocation: 672 bytes)
7×9 FixedSizeArray{Tuple{7,9},Float64,2,Tuple{1,8},79} with indices 1:1:7×1:1:9:
  0.76966     1.85181    -0.319665   1.61656    0.217049   0.382989  -0.746271   0.660974   0.393029
 -3.42086    -4.45456    -4.39892   -3.04383   -2.43171   -2.8492    -5.49892   -3.58348   -3.77883
 -2.11739    -2.75008    -0.982609  -1.16154   -1.06353   -2.63578   -2.49486   -1.59256   -2.19798
  1.69544    -0.754236   -0.523444  -1.05262   -2.22792   -1.26843   -0.590602   1.47988   -1.0712
 -1.21998     0.0110744  -1.47765   -0.348365  -0.381716   0.174942  -0.18653   -2.88394   -0.318937
 -0.0642002   0.0172898  -0.807196   1.16906   -0.345746   1.38633    0.964985   0.449694   0.234085
 -0.800639   -1.26078     0.485136  -1.02432   -0.131706   0.520508  -1.50308   -1.72517   -1.96775

julia> @btime @. $Asm + log($bsv)
  293.517 ns (0 allocations: 0 bytes)
7×9 SArray{Tuple{7,9},Float64,2,63} with indices SOneTo(7)×SOneTo(9):
  0.76966     1.85181    -0.319665   1.61656    0.217049   0.382989  -0.746271   0.660974   0.393029
 -3.42086    -4.45456    -4.39892   -3.04383   -2.43171   -2.8492    -5.49892   -3.58348   -3.77883
 -2.11739    -2.75008    -0.982609  -1.16154   -1.06353   -2.63578   -2.49486   -1.59256   -2.19798
  1.69544    -0.754236   -0.523444  -1.05262   -2.22792   -1.26843   -0.590602   1.47988   -1.0712
 -1.21998     0.0110744  -1.47765   -0.348365  -0.381716   0.174942  -0.18653   -2.88394   -0.318937
 -0.0642002   0.0172898  -0.807196   1.16906   -0.345746   1.38633    0.964985   0.449694   0.234085
 -0.800639   -1.26078     0.485136  -1.02432   -0.131706   0.520508  -1.50308   -1.72517   -1.96775

We pay a price in allocations, but that’s comparatively small.

Optional Sizing

julia> using PaddedMatrices

julia> A = StrideArray{Float64}(undef, (100,PaddedMatrices.Static(3)))

julia> axes(A,1) |> typeof
Base.OneTo{Int64}

julia> axes(A,2) |> typeof
VectorizationBase.StaticUnitRange{1,3}

LoopVectorization can use read sizes from a StaticUnitRange to know the dimensions while optimizing a loop, and of course the Julia’s compiler itself/LLVM will be aware.

Again, this library is experimental and a work in progess, but I figured some folks may find it interesting, or – better yet – useful.

47 Likes

Awesome benchmarks!

Do you know if using @llvm.matrix.multiply makes sense for small fixed-size matrices?

Regarding the multi-threaded algorithm you described, would it mean that the tasks will be joined/synchronized before and after the packing of \tilde B_p? (I’m guessing packing means copying to a contiguous array?) I don’t have a good intuition there, but, I’m guessing that an alternative approach to use separate \tilde B_p for each task/thread has less synchronization overheads?

3 Likes

I haven’t tried it yet, but it is definitely worth trying.
It was introduced in llvm10, and I’m still on llvm9 on my default Julia.
But I did build Julia with (an early commit of) llvm11 recently to try out Brutus…

I’ll test it and benchmark in the morning.

It needs to synchronize on each iteration of the 1:kc:K loop, because those iterations have to be performed serially. This is the loop that packs \tilde{B}_p (which does mean copy to a contiguous array).
So yes, that would mean we’d have to synchronize somehow here.
I’ve never used locks before. Would that be the way to do it?
Alternatively, you can have just a small number of tasks (possibly 1) with their own \tilde{B}_p, and have them spawn new tasks for each \tilde{A}_p. That would be easy to implement, but have the overhead of lots of spawned tasks over the course of the matmul.
Synchronization is probably the better (faster) route?
Ideally, it should play nice with partr.

The problem with giving each thread it’s own B-packed is that these will be smaller. We need to repack the entirety of A for each \tilde{B}_p (or maybe cache \tilde{A}_p as we build it? That`d need some synchronization, and increase total memory requirements – at which point we should start asking, “why not Strassen?”), which is introducing it’s own overhead.

So there’s a trade-off in synchronization and repacking. Worth trying both (also with unequally sized A and B) to see how they compare.

3 Likes

This is pretty amazing! We (a colleague and I) just revived the old example from Julia 0.4 for doing parallel LU. Right now, the hpl_shared.jl in this repo is where the multi-threaded version lives.

I’m looking for the fastest 64x64 or 128x128 matmul in Julia. Currently, the sequential run of hpl_shared.jl is about 10-15% slower than openblas for the 4096x4096 matmul I have in there. It is basically breaking up the LU into small problems and dispatching to BLAS/LAPACK kernels.

This code gives some parallelism with threading, but more work needs to be done to expose all the parallelism possible.

With where this is at, I feel that that matching high performance implementations is within reach.

-viral

6 Likes

@Elrod, You are amazing!
I remember when we talked about SIMD and the trick about padding Matrices so each row will number of elements which is a multiplication of the SIMD width which brought PaddedMatrices. You made miracles with that trick. Bravo!

Could you tell at what matrix size OpenBLAS / MKL starts using Multi Threading?
I also think it would be nice to compare to MKL’s JIT feature for small matrices.

With Julia synthetic sugar you can create broadcasting of Matrix Multiplication in a style like Packed GEMM and Batch GEMM.

By the way, I think Images could benefit a lot from this package.
The approach of padding is widely used in this area and it solves the issue with taking care of aligned load and the tail of each row.

1 Like

@tkf

So far, these intrinsics have been a great way to crash Julia.

julia> x = rand(120);

julia> using SIMDPirates

julia> @code_llvm debuginfo=:none SIMDPirates.vcolumnwiseload(pointer(x), 4, Val(3), Val(3))
define <9 x double> @julia_vcolumnwiseload_1445(i64 %0, i64 %1) {
top:
  %2 = trunc i64 %1 to i32
  %ptr.i = inttoptr i64 %0 to <9 x double>*
  %res.i = call <9 x double> @llvm.matrix.columnwise.load.v9f64.p0v9f64(<9 x double>* %ptr.i, i32 %2, i32 3, i32 3)
  ret <9 x double> %res.i
}

julia> @code_native debuginfo=:none SIMDPirates.vcolumnwiseload(pointer(x), 4, Val(3), Val(3))

signal (11): Segmentation fault
in expression starting at REPL[4]:1
unknown function (ip: 0x7f05fabef219)
unknown function (ip: 0x556aeb05ae4f)
Allocations: 27836302 (Pool: 27823532; Big: 12770); GC: 25
fish: “/home/chriselrod/Documents/lang…” terminated by signal SIGSEGV (Address boundary error)

Trying 4x4:

julia> x = rand(120);

julia> using SIMDPirates

julia> @code_native debuginfo=:none SIMDPirates.vcolumnwiseload(pointer(x), 4, Val(4), Val(4))
LLVM ERROR: Do not know how to split the result of this operator!

Trying matrix multiplication:

julia> using SIMDPirates

julia> vA9 = ntuple(Val(9)) do _ Core.VecElement(rand()) end;

julia> vB9 = ntuple(Val(9)) do _ Core.VecElement(rand()) end;

julia> @code_llvm debuginfo=:none SIMDPirates.vmatmul(vA9, vB9, Val(3), Val(3), Val(3))
define <9 x double> @julia_vmatmul_1626(<9 x double> %0, <9 x double> %1) {
top:
  %res.i = call <9 x double> @llvm.matrix.multiply.v9f64.v9f64.v9f64(<9 x double> %0, <9 x double> %1, i32 3, i32 3, i32 3)
  ret <9 x double> %res.i
}

julia> @code_native debuginfo=:none SIMDPirates.vmatmul(vA9, vB9, Val(3), Val(3), Val(3))

signal (11): Segmentation fault
in expression starting at REPL[5]:1
unknown function (ip: 0x7f1844a60f2d)
unknown function (ip: 0x558f2fe159cf)
Allocations: 31635619 (Pool: 31621726; Big: 13893); GC: 30
fish: “/home/chriselrod/Documents/lang…” terminated by signal SIGSEGV (Address boundary error)

Trying 4x4:

julia> using SIMDPirates

julia> vA16 = ntuple(Val(16)) do _ Core.VecElement(rand()) end;

julia> vB16 = ntuple(Val(16)) do _ Core.VecElement(rand()) end;

julia> @code_llvm debuginfo=:none SIMDPirates.vmatmul(vA16, vB16, Val(4), Val(4), Val(4))
define <16 x double> @julia_vmatmul_1513(<16 x double> %0, <16 x double> %1) {
top:
  %res.i = call <16 x double> @llvm.matrix.multiply.v16f64.v16f64.v16f64(<16 x double> %0, <16 x double> %1, i32 4, i32 4, i32 4)
  ret <16 x double> %res.i
}

julia> @code_native debuginfo=:none SIMDPirates.vmatmul(vA16, vB16, Val(4), Val(4), Val(4))
LLVM ERROR: Do not know how to split the result of this operator!

There are plenty of examples here, but all I’ve gotten so far is crashes.

Thanks! @YingboMa also expressed interest in LU. I’ll add a 5-argument jmul! method soon, but it sounds like your approaches are different.

If you only need 128x128 and 64x64, you could use LoopVectorization for the kernels directly. The kernel implementations are simply three loops.
Versus OpenBLAS:

julia> using PaddedMatrices: loopmulprefetch!, loopmul!

julia> using LinearAlgebra, BenchmarkTools; BLAS.set_num_threads(1)

julia> loopmul! === loopmulprefetch!
false

julia> M = K = N = 128;

julia> A = rand(M,K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> C2 = similar(C1); C3 = similar(C1);

julia> @benchmark loopmul!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.931 μs (0.00% GC)
  median time:      38.113 μs (0.00% GC)
  mean time:        38.169 μs (0.00% GC)
  maximum time:     107.158 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark loopmulprefetch!($C2, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     39.655 μs (0.00% GC)
  median time:      39.823 μs (0.00% GC)
  mean time:        39.874 μs (0.00% GC)
  maximum time:     84.034 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C3, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     39.916 μs (0.00% GC)
  median time:      40.078 μs (0.00% GC)
  mean time:        40.140 μs (0.00% GC)
  maximum time:     116.362 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> C1 ≈ C2 ≈ C3
true

Static sizing improves performance a little:

julia> using LoopVectorization

julia> function AmulB128!(C, A, B)
           @avx for n in 1:128, m in 1:128
               Cmn = zero(eltype(C))
               for k in 1:128
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB128! (generic function with 1 method)

julia> @benchmark AmulB128!($C4, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     36.378 μs (0.00% GC)
  median time:      36.739 μs (0.00% GC)
  mean time:        36.779 μs (0.00% GC)
  maximum time:     111.469 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I have benchmarks (testing the dynamic implementation) on all square sizes from 2 through 256 here, and the micro kernel performs quite well at most multiples of 8.
Without AVX512 but with AVX2, it should perform well on multiples of 4.

MKL, however, is faster.

Also, you may want to adjust the size of these kernels based on element type and architecture.
For AVX512 and double precision, for example, LoopVectorization will use a 40x5 microkernel.
That means 128x128 multiplication involves 3-row iterations of the kernel, and then an iteration that handles the remaining 8.
It’ll also need 25 column-iterations, and then handle the remaining 3.
A 120x120 kernel, however, would need 3-row iterations and 24 column-iterations, without any remainders.
Although the performance difference seems small, at best, you can see the tallest peaks in the chart are at multiples of 40.

julia> function AmulB120!(C, A, B)
           @avx for n in 1:120, m in 1:120
               Cmn = zero(eltype(C))
               for k in 1:120
                   Cmn += A[m,k] * B[k,n]
               end
               C[m,n] = Cmn
           end
       end
AmulB120! (generic function with 1 method)

julia> M = K = N = 120;

julia> A = rand(M,K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB120!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.194 μs (0.00% GC)
  median time:      29.386 μs (0.00% GC)
  mean time:        29.437 μs (0.00% GC)
  maximum time:     96.440 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> 2e-9 * 128^3 / 36.378e-6
115.29781736214196

julia> 2e-9 * 120^3 / 29.194e-6
118.38048914160446

Less than 3% difference, which could be noise and probably not worth making major-changes over.

For AVX(2), it’ll use a 12x4 micro kernel. I haven’t run benchmarks at a range of sizes on that platform, but that’d suggest 120x120 might get better performance there as well. Altohough if there is a difference, probably by similarly slim-margins.
Meaning this probably shouldn’t overrule other architectural reasons to prefer a power-of-2.

@RoyiAvital
Thanks. Back then, I didn’t even know what alignment meant.
The paddedmatrices here are always aligned to a multiple-of-simd-vector-width boundary, so that each of their columns are aligned.
Masked loads/stores mean that it isn’t needed for vectorization, but the faster memory accesses still improve performance. They also have the benefit of not needing to worry about subnormal numbers in the padded memory.

The benchmarks I posted were single threaded. OpenBLAS is set to start using them at 50. I couldn’t immediately find documentation on that flag, but maybe it means M*K*N >= 50^3?

MKL is much faster when multithreading. It ramped up performance very quickly, and strangely actually achieved higher performance / core than single threaded – roughly their sub-L2 GFLOPS on every core, IIRC.

The approach of padding is widely used in this area and it solves the issue with taking care of aligned load and the tail of each row.

I have my own projects/plans for this package. It will be a while before I would be able to focus on projects like image filtering myself, but in the mean time I would by happy to answer questions, fix bugs, assist others in making PRs, etc.

LoopVectorization has developer docs (largely thanks to Tim Holy), while libraries like PaddedMatrices are much simpler.

Do you know when MKL starts using Multi Threaded Algorithm?
By the way, have you looked on MKL-DNN (Now One-DNN)?

Their code is open sourced and it is made by Intel.
So it might be their strategy for Multi Threading will be a good starting point.

Thanks, I’ll take a look next time I play around with threading.

Any API suggestions?

Maybe I’d start with a Macro for Loop?
Namely the user will write a loop, annotate it with a macro and voila :-).

That way the API isn’t that important and you can just stick to what Intel has done.

Note the spikes in the very first plot at 8,16,24 etc. bytes.
IMHO that is an effect of mathching to cache line sizes, as I said in another thread.
I believe the posh name for this is striding.

Also I not HPL.jl ismentioned, which makes an HPC person’s heart sing.
HPL is heavily debated as a relevant benchmark but
a) it is the one which is used for machine to machine comparison
b) leaving its merits for real world code predictions aside it acts as a fine method to heat up and exercise all parts of a machine

1 Like

That shouldn’t be a problem. I was hoping to see a bigger performance improvement:

julia> using PaddedMatrices

julia> A7 = @FixedSize randn(7,7);

julia> pointer(A7) - pointer_from_objref(A7)
0x0000000000000010

julia> reinterpret(Int, pointer(A7)) % 64
0

julia> A7
7×7 FixedSizeArray{Tuple{7,7},Float64,2,Tuple{1,8},63} with indices 1:1:7×1:1:7:
 -1.90547     0.721107  -0.260631  -0.117046  -2.08072   -0.537309    1.13826
 -0.738925    0.483948   1.4337     0.666807  -0.989619  -0.767602    0.913178
 -1.50427    -0.504106  -0.959064   0.778618   0.515979   0.942165   -2.61014
 -0.490966   -2.12725    1.54583   -0.297469   1.28771    0.0597469   0.209404
  0.372383   -0.641325  -1.63048   -0.167892  -0.952446   0.0125307   0.882734
 -0.568784    2.28455    1.11837   -0.669973   0.863629  -0.195866   -0.128418
  0.0370707  -0.288556   0.382812  -1.79589    1.03886    1.21452    -0.00506172

julia> unsafe_load(pointer(A7), 1)
-1.905474142429087

julia> unsafe_load(pointer(A7), 9)
0.7211069374508638

julia> unsafe_load(pointer(A7), 17)
-0.2606314548914012

The pointer is aligned to a 64-byte boundary, as is the first element of every column (even though the matrix is 7x7).

Not a problem at all ! I was pointing out that the physical layout of a CPU can have effects like this., and you can see them if you look hard.

I thought you just meant memory alignment, i.e., memory being aligned to cache line boundaries. Do you mean something else by “matching to cache line sizes”?

LOL. Thanks for trying it out and sharing the result. Let’s hope it’d be fixed at some point…

I was mentioning \tilde B_p because, IIUC, sharing \tilde B_p introduces a non-parallelized part of the code (or one more extra synchronization if it makes sense to parallelize copying), is that right? I wondered if it’d be nice to somehow avoid it. I may be saying something very stupid as this is only O(KN) but, since loading cold memory is much slower than floating arithmetics, and using more threads may not help with it, I thought I’d ask about it.

I usually fetch or wait on tasks for synchronization. I guess it may be possible to come up with a better queue than Julia is using that is specific to the program you have. I don’t know what it’s like though.

Do you mean nested parallelism or something else? (IIUC nested parallelism doesn’t matter for your case because your function is at the “leaf” of the call tree. OTOH, something like threaded mapreduce needs to care about it since it may be calling something like threaded GEMM.)

1 Like

I made tweaks, that unfortunately appear to have caused a performance regression for larger arrays.

I did however add specialized methods for arrays with static strides of 2 and 4, and added PtrArray to the sized benchmarks. (EDIT: These don’t actually seem to be better than just what LLVM does with unrolled code, so I’ve turned these back off on my local master).
As far as I understand, I should have more or less identical performance to PaddedArray. As you can see from the graph, my understanding is limited.

For 10,000 x 10,000 arrays, I’m not currently beating the GFLOPS I get for an 8x8 PtrArray.
But the larger kernel being used for those arrays should be much more efficient.
I think I need to add real prefetch support to LoopVectorization, so I can have it add them at better places. That’s the current reason I don’t have prefetching with AVX2 yet.

I also experimented a little with rearranging the elements of A while packing, into an order so that we load elements contiguously, hoping that triggers the hardware prefetcher and eliminates the need for software prefetching. Performance wasn’t as good.

I think we should parallelize the packing. Each thread that shares a \tilde{B}_p could pack a chunk, and then they must synchronize before resuming the multiplication.
If we don’t parallelize packing, we’d still need a synchronization step for the other threads to wait until packing is complete before reading from it.

The synchronization wouldn’t be needed if each thread had their own \tilde{B}_p, without packing. But synchronization should be cheaper than packing \tilde{A}_p, which is why I suspect this to be the better approach.

Packing \tilde{B}_p is O(KN), while packing \tilde{A}_p is O(MKN). While A itself is only MxK elements, we repack for for each \tilde{B}_p – hence, the desire to make \tilde{B}_p larger.

I should really spend some time look through some of the opensource BLAS libraries and see if I can figure out how they do it. I’m probably doing some things wrong already (I’m slower), and especially MKL have exceptional multithreaded performance.

Hmm. How will this work with subdividing the problem?

Referring back to packing \tilde{B}_p, I’d like several threads to be able to pack it, then proceed to start packing \tilde{A}_p before checking to confirm everyone has finished their chunk of \tilde{B}_p, and continuing on with calculating \tilde{A}_p \times \tilde{B}_p.

I could YOLO it and just assume that of-course they’ll be done by the time \tilde{A}_p is done, but that’d bound to cause problems eventually.

If instead one thread runs things by @spawn and then fetch/wait, and then @spawns for the next task, those new @spawns may not end up on the same physical core?
Because \tilde{B}_p is supposed to live in (shared-between-cores) L3 instead of (core-local) L2, maybe that isn’t much of problem. We just won’t be able to pack \tilde{A}_p before checking that \tilde{B}_p is done.

Alternatively, maybe I could implement it with a few refs or atomic variables, and busy-wait with while loops.

Yeah, you’re right. The only real concern there is that we’d want it to be able to nest inside other parallel code. The other parallel code may also call parallel instances of jmult! that shouldn’t step on each other. But it doesn’t sound like that should be a real problem. The preallocated memory for packing is already set up to be thread local, for example.

1 Like

Nim’s laser package might be the easiest reading:

4 Likes

Yeah, sounds like a great approach than hearing my guesses :+1:

Is \tilde A_p an array that concatenates \tilde A_i? So the plan is to create \tilde A_p as well as \tilde B_p for each iteration of the 3rd loop, while a sub-array \tilde A_i is loaded to L1 for each iteration of the 2nd loop? If that’s the case, why not spawn the tasks for creating \tilde A_p and \tilde B_p at once and then wait for them (2 * nthreads() tasks, I guess)?

There is no public API for even pinning task to a thread. I guess you can write your own “task pool” to give some hints to partr:

julia> function demo()
           chans = map(1:Threads.nthreads()) do _
               Channel(1, spawn=true) do ch
                   for f in ch
                       f === nothing || f()
                   end
               end
           end
           push!(chans[1], () -> @show Threads.threadid())
           push!(chans[1], () -> @show Threads.threadid())
           return chans
       end
demo (generic function with 1 method)

julia> chans = demo()
       sleep(1)
Threads.threadid() = 1
Threads.threadid() = 1

julia> foreach(close, chans)

Though my impression is that Channel has higher overhead than @spawn/fetch.

1 Like

Thanks, worth taking a look at. @YingboMa has been walking me through the OpenBLAS source. On my desktop, it does very well over the size range 200-10,000.
MKL eventually overtakes it (and achieves mysteriously impressive speeds below 190x190, insensitive to factors such as memory layout), but OpenBLAS’s performance over that range in single threaded performance captured our interest/focus.

To clarify, the BLIS algorithm, which I’ve been using, is:

# C is M x N
# A is M x K
# B is K x N
for n in 1:n_c:N
    for k in 1:k_c:K
        # pack B_p
        for m in 1:m_c:M
            # pack A_p
            # call macrokernel
        end
    end
end

B_p is in L3, and A_p in L_2. The macrokernel iterates over slices of B_p that fit in the L1 cache.
It’s because B_p is preferably very large, taking up more than 1/num_cores % of the shared L3 cache, that it makes sense to let it be large, and share it among threads.

Because A_p is in the L2 cache (and thus local to one core), these should probably be unique per thread.

However, this is not the algorithm OpenBLAS uses. It seems to user finer grained packing, closer to the microkernel. Aside from getting the best single threaded performance (at least on this computer, over a large size range), this is also more compatible with your suggestions. I’ll have to look more closely, but it looks like they pack A_p into L2, and then pack B_p into L1 within the macrokernel (and don’t pack B outside of it).
Although, I haven’t actually looked at OpenBLAS’s threading code yet. Also worth pointing out that MKL’s threading is far better than OpenBLAS’s (while OpenBLAS’s would also be extremely difficult to match).

My primary interests here are

  1. Small-matrix performance
  2. Not suffering at large sizes, so that someone looking for small matrix performance doesn’t have to be concerned with a major trade-off.
  3. Enabling high-performance Julia Linear Algebra libraries well suited for smaller problems, such as RecursiveFactorization.jl
  4. Proof of concept that these libraries can perform well.

In terms of low hanging fruit, large-matrix linear algebra provides limited picking.

YingoboMa has been working on MaBLAS.jl.
Benchmarks over the range 2:256:


Rather than being based on LoopVectorization (which it uses for clean up loops), MaBLAS.jl uses custom code to generate kernels. It has an API making it easy to specify different kernels or BLIS-style blocking factors and fine grained control over packing in the API.
I added three instances of MaBLAS to the mix, each with different kernels, as well as Gaius, OpenBLAS, MKL, and PaddedMatrices.
MaBLAS will probably switch to just using the clean up loops below a certain size, as the LoopVectorization-kernels do very well at handling dynamic sizes.

MKL is ahead from 40-190, while PaddedMatrices stays in second through this range, and first through 256. MaBLAS’s kernels generate better assembly than LoopVectorization does. LoopVectorization’s kernels load pointers and/or offsets from memory on every iteration, and then use these to load the elements being operated on, while MaBLAS’s kernels do exactly what you want: no reloading of pointers, no incrementing of any integers aside from the loop induction variable.
For the life of me, I haven’t been able to get LLVM to do this correctly, but it doesn’t seem to be too big a deal.
The performance is also notably higher than it was in the opening post, e.g. PaddedMatrices now staying above 90 GFLOPS from 200-256, instead of dipping into the low 80s.

Larger arrays:


It’s hard to separate most of them over most of this range, while they hover around the 100 GFLOPS mark. OpenBLAS comes out ahead, while the 32x6 and 40x5 MaBLAS routines aren’t far behind.

I think this is a great showing from Julia BLAS!

9 Likes

This is a great progress.
I wonder how much of the wonder of @avx should be imported into @simd (Which in my opinion should be more conservative).

Regarding optimized kernels for Small Matrices, have a look at BLASFEO.
For single threaded they are much better than MKL up to sizes of ~400.

  1. I’d have to see benchmarks to believe that.
  2. They don’t support my architecture, so I’d be forced to use Haswell, which limit my CPU to a theoretical peak of 70 GFLOPS, below the above results.

EDIT:
I’d like to add:
I tuned those libraries on the computer I ran the benchmarks on.
How well will they generalize to other computers? Hopefully reasonably well, but I wouldn’t necessarily expect them to so closely match performance over that range with OpenBLAS and MKL on different architecures. Similarly, everyone’s benchmarks claim they’re the fastest. Polly claims they can optimize gemm to be close to OpenBLAS, but in my benchmarks polly took an eternity to compile, maybe achieving a third the speed of OpenBLAS, not even matching icc or ifort applied to triple loops and gemm-detection disabled (and also having more or less instantaneous compile times), so I quit bothering with it and removed Polly from my LoopVectorization benchmarks. Similarly, Eigen claims to have competitive performance, but doesn’t do well at all either. Unlike Polly, Eigen is pleasant to compile, so it gets to stay in the LoopVectorization benchmarks.
I’d have to run benchmarks locally to “believe” them. Same should apply to you or anyone reading these. Sure, I got good performance on my computer after tuning them for it, but will they get good performance on yours? I’d be skeptical until these have matured much more.

1 Like