PaddedMatrices is an experimental and not well tested. Issues/PRs/comments are all welcome.

The package is registered, so you can install via `] add PaddedMatrices`

.

More functionality is coming in the future, but the two main things it does now:

- Julia BLAS.
- Optionally sized arrays, dispatching on the Julia BLAS by default.

#### Initializing Arrays

You can create random `FixedSizeArray`

s via the `@FixedSize`

macro. This macro currently only accepts `rand`

, and `randn`

for generating `Float64`

elements.

## Creating random arrays

```
julia> using PaddedMatrices
julia> A = @FixedSize rand(8,8)
8Ć8 FixedSizeArray{Tuple{8,8},Float64,2,Tuple{1,8},71} with indices 1:1:8Ć1:1:8:
0.683092 0.0519738 0.123364 0.894025 0.13376 0.148439 0.604513 0.689693
0.318745 0.751111 0.214771 0.221637 0.782232 0.329274 0.210776 0.151637
0.691374 0.0600692 0.317934 0.25582 0.881472 0.734963 0.603998 0.855192
0.674639 0.717255 0.60677 0.343281 0.119153 0.340493 0.272141 0.202147
0.992593 0.0692194 0.256404 0.922214 0.129515 0.473576 0.863022 0.749924
0.115591 0.764342 0.484487 0.76594 0.970529 0.474724 0.5216 0.346497
0.214091 0.931094 0.622827 0.00131706 0.100016 0.103876 0.27451 0.912466
0.82846 0.000283631 0.784817 0.716793 0.711505 0.336038 0.723893 0.824162
julia> x = @FixedSize rand(3)
3-element FixedSizeArray{Tuple{3},Float64,1,Tuple{1},15} with indices 1:1:3:
0.09671375198196175
0.5290890298633636
0.2891468539564671
julia> Y = @FixedSize randn(2,3,4)
2Ć3Ć4 FixedSizeArray{Tuple{2,3,4},Float64,3,Tuple{1,2,6},31} with indices 1:1:2Ć1:1:3Ć1:1:4:
[:, :, 1] =
0.162202 -0.122953 0.268173
1.92397 1.85739 0.738878
[:, :, 2] =
-0.791952 -0.181461 -0.515349
0.160009 -0.827614 1.59702
[:, :, 3] =
1.17112 -0.902177 -0.949055
0.863118 -1.24432 0.94861
[:, :, 4] =
-0.914365 -0.681319 0.428368
-1.93309 -0.552553 0.859877
```

These RNGs should be relatively fast. You can set the seed with `PaddedMatrices.VectorizedRNG.seed!`

, which is independent of `Random.seed!`

. Setting the seed is not thread-safe (it effects all threads).

#### Benchmarking small, fixed size matrix multiplication

Small, fixed-size, array benchmarks, comparing with `SMatrix`

and `MMatrix`

from StaticArrays.jl.

`PtrArray`

is the fastest at all sizes, but `SMatrix`

is faster than the `FixedSizeArray`

for `2x2`

, `3x3`

and `8x8`

matrices on my computer, while `MMatrix`

is only faster for `3x3`

matrices (but not faster than `PaddedArray`

). Starting at `6x6`

, `MMatrix`

is slower than the `jmul!`

method applied to regular-old `Array{Float64,2}`

s.

When creating

`FixedSizeArray`

s, they are padded by default. This is to align their columns in memory, which makes indexing the top of their columns faster, improving performance slightly.The

`PtrArray`

type has lower overhead, for some reason. The `FixedSizeArray`

is a `mutable struct`

that holds an `NTuple`

(the memory), and an aligned pointer to that memory. `PtrArray`

is just a `struct`

holding the pointer; any use must be protected with `GC.@preserve memory`

.
Beyond 8x8, multiplying regular (dynamically sized) `Matrix{Float64}`

using `PaddedMatrices.jmul!`

was faster than `SMatrix`

and `MMatrix`

multiplication. Of course, the StaticArrays documentation recommends them primarily for `10x10`

arrays and smaller. Their compile times regress significantly beyond that, even more than the runtimes.

Worth pointing out that a low-hanging fruit for `StaticArray`

s performance is to start using `muladd`

. The `SMatrix`

type also has the advantage of getting the best performance out of a more convenient, non-mutating, API. The others were all benchmarked with in-place variants.

Results probably look different depending on instruction set.

#### Large, dynamically-sized array multiplication (single-threaded)

These benchmarks all used the base `Matrix{T}`

type, comparing OpenBLAS 0.3.9, Intel MKL through `ccall`

, Gaius.jl and `PaddedMatrices.jmul!`

.

`Float64`

benchmarks:

OpenBLAS had the best performance for large arrays, while MKL had the best performance between

`40x40`

and `200x200`

.Once the arrays no longer fit in the L2 cache (beyond the 200x200), PaddedMatricesā performance dropped, and it never really recovered. This suggests the library still needs work on how it handles memory bandwidth.

It implements this algorithm (image belongs to the BLIS team):

You can see the default tiling parameters for your architecture via

```
julia> mc, kc, nc = PaddedMatrices.matmul_params(Float64, Float64, Float64)
(200, 300, 600)
```

Arguments are the element types of `C`

, `A`

, and `B`

, respectively, for `PaddedMatrices.jmul!(C, A, B)`

(meaning `C = A * B`

).

The three return values are the size of outer-loop increments for the `M`

, `K`

, and `N`

loops, respectively, where `C`

is `M x N`

, `A`

is `M x K`

, and `B`

is `K x N`

.

That is, it will work with `mc x kc`

tiles of `A`

and `kc x nc`

tiles of `B`

at a time, to calculate `mc x nc`

tiles of `C`

.

These sizes are meant to both (a) let the `A`

tiles fit in L2 cache and the `B`

tiles in L3, following the above diagram, and (b) be integer multiples of the dimension of the micro-kernels.

If you would like to play around with different values, you can call

```
PaddedMatrices.jmul!(C, A, B, Val(mc), Val(kc), Val(nc))
```

But be careful ā youāll likely get corrupted answers or a crash if you make the tiles of `A`

and `B`

much larger than what `PaddedMatrices.matmul_params`

returns.

You can get the size of the micro kernel via:

```
julia> (PaddedMatrices.mįµ£ * PaddedMatrices.VectorizationBase.pick_vector_width(Float64), PaddedMatrices.nįµ£)
(40, 5)
```

This means (on this architecture) that youād want to keep `mc`

a multiple of 40, and `nc`

a multiple of 5.

`Gaius.jl`

(which doesnāt yet use the same algorithm) uses a ācache obliviousā recursive algorithm, but unfortunately cache (and TLB)-aware tiling algorithms are needed to keep the memory-boogeyman at bay.

Using `Float32`

, OpenBLAS and MKL are neck and neck for large arays, while MKL still dominates at the more moderate sizes:

Interestingly, PaddedMatrices gets a little over twice the GFLOPS with single precision (instead of just twice).

Those BLAS libraries donāt support integer multiplication, so that provides an area where where we have a lot to gain. Benchmarking vs LinearAlgebraās generic matmul for `Int64`

(note the mislabled `y`

-axis, it should be `GIOPS`

, because these arenāt floating point operations):

We get a lot less operations per second out of

`Int64`

than `Float64`

, primarilly because SIMD `Int64`

multiplication is very slow. On `AVX512`

architectures, like the computer where I ran these benchmarks, it is implemented using the `vpmullq`

instruction, which has at best 1/3 the throughput as `fma`

.Without AVX512, but with AVX2, itās multiplied using

`Int32`

multiplication, shifts, and additions.Without AVX2, you donāt have SIMD integer operations.

`Int32`

does better:

But is still slower than

`Float64`

. Note that `Float64`

can exactly represent every integer from `-(2^53)`

to `2^53`

, which is better than `Int32`

, which can only represent `-(2^31)`

to `2^31-1`

.
So maybe integer matrix multiplication isnāt that useful. Still, itās nice to see the performance you can gain vs a more naive implementations.

Performance relative to OpenBLAS/MKL will likely be a little worse on AVX2, because `PaddedMatrices`

is currently using prefetch instructions to help with cache latency with AVX512, but the way I added calls to them resulted in using an excessive amount with AVX2 to the point that performance was worse than not using any.

#### Large, dynamically-sized array multiplication (multi-threaded)

There is a function `PaddedMatrices.jmult!`

, but performance is bad at the moment.

I had it spawn a new task for every tile of `A`

and `B`

. This is too few tasks to get good utilization out of a many-core CPU for all but very large matrices, but for very-large matrices the, number of tasks then quickly becomes excessive.

First of all, note that you can parallelize the `M`

and `N`

loops, corresponding to blocks of `A`

and `B`

, respectively. Parallelizing the `K`

loop would be awkward (e.g. race conditions on updating `C`

?), so Iāll ignore that possibility.

We could imagine a simple āend gameā algorithm for extremely large matrices:

For very large matrices `A`

, with enough rows that you can divide reasonable blocks up among cores and not have an awkward remainder.

If the overhead on spawning tasks were smaller, the best approach for very large matrices may be just a slight modification of the current algorithm: multiply the size of the tiles of `B`

by `Threads.nthreads()`

. These are supposed to live in the L3 cache, which is shared between cores. Then spawn separate tasks, which take tiles of `A`

(in the L2 cache, local to specific cores) and multiply them with the giant tile of `B`

.

This would have the advantage of enabling very big tiles of `B`

, amortizing the cost of packing the `A`

-tiles.

To generalize this for matrices too small to qualify for `1.`

, we could find out how many blocks of `B`

we need to get good utilization of the CPU cores, then divide up the L3 space for that many blocks of `B`

. Parallelize these `B`

blocks, and the `A`

blocks within them.

This algoirthm transitions nicely to the extreme case above.

If matrices are too small, we could start shrinking `B`

and `A`

block sizes for the sake of increased parallelism.

#### FixedSizeArrays use LoopVectorization for broadcasting

```
julia> using PaddedMatrices, StaticArrays, BenchmarkTools
julia> Afs = @FixedSize randn(7,9); Asm = SMatrix{7,9}(Array(Afs));
julia> bfs = @FixedSize rand(7); bsv = SVector{7}(bfs);
julia> @btime @. $Afs + log($bfs)
74.276 ns (1 allocation: 672 bytes)
7Ć9 FixedSizeArray{Tuple{7,9},Float64,2,Tuple{1,8},79} with indices 1:1:7Ć1:1:9:
0.76966 1.85181 -0.319665 1.61656 0.217049 0.382989 -0.746271 0.660974 0.393029
-3.42086 -4.45456 -4.39892 -3.04383 -2.43171 -2.8492 -5.49892 -3.58348 -3.77883
-2.11739 -2.75008 -0.982609 -1.16154 -1.06353 -2.63578 -2.49486 -1.59256 -2.19798
1.69544 -0.754236 -0.523444 -1.05262 -2.22792 -1.26843 -0.590602 1.47988 -1.0712
-1.21998 0.0110744 -1.47765 -0.348365 -0.381716 0.174942 -0.18653 -2.88394 -0.318937
-0.0642002 0.0172898 -0.807196 1.16906 -0.345746 1.38633 0.964985 0.449694 0.234085
-0.800639 -1.26078 0.485136 -1.02432 -0.131706 0.520508 -1.50308 -1.72517 -1.96775
julia> @btime @. $Asm + log($bsv)
293.517 ns (0 allocations: 0 bytes)
7Ć9 SArray{Tuple{7,9},Float64,2,63} with indices SOneTo(7)ĆSOneTo(9):
0.76966 1.85181 -0.319665 1.61656 0.217049 0.382989 -0.746271 0.660974 0.393029
-3.42086 -4.45456 -4.39892 -3.04383 -2.43171 -2.8492 -5.49892 -3.58348 -3.77883
-2.11739 -2.75008 -0.982609 -1.16154 -1.06353 -2.63578 -2.49486 -1.59256 -2.19798
1.69544 -0.754236 -0.523444 -1.05262 -2.22792 -1.26843 -0.590602 1.47988 -1.0712
-1.21998 0.0110744 -1.47765 -0.348365 -0.381716 0.174942 -0.18653 -2.88394 -0.318937
-0.0642002 0.0172898 -0.807196 1.16906 -0.345746 1.38633 0.964985 0.449694 0.234085
-0.800639 -1.26078 0.485136 -1.02432 -0.131706 0.520508 -1.50308 -1.72517 -1.96775
```

We pay a price in allocations, but thatās comparatively small.

#### Optional Sizing

```
julia> using PaddedMatrices
julia> A = StrideArray{Float64}(undef, (100,PaddedMatrices.Static(3)))
julia> axes(A,1) |> typeof
Base.OneTo{Int64}
julia> axes(A,2) |> typeof
VectorizationBase.StaticUnitRange{1,3}
```

LoopVectorization can use read sizes from a `StaticUnitRange`

to know the dimensions while optimizing a loop, and of course the Juliaās compiler itself/LLVM will be aware.

Again, this library is experimental and a work in progess, but I figured some folks may find it interesting, or ā better yet ā useful.