PaddedMatrices is an experimental and not well tested. Issues/PRs/comments are all welcome.
The package is registered, so you can install via ] add PaddedMatrices
.
More functionality is coming in the future, but the two main things it does now:
- Julia BLAS.
- Optionally sized arrays, dispatching on the Julia BLAS by default.
Initializing Arrays
You can create random FixedSizeArray
s via the @FixedSize
macro. This macro currently only accepts rand
, and randn
for generating Float64
elements.
Creating random arrays
julia> using PaddedMatrices
julia> A = @FixedSize rand(8,8)
8×8 FixedSizeArray{Tuple{8,8},Float64,2,Tuple{1,8},71} with indices 1:1:8×1:1:8:
0.683092 0.0519738 0.123364 0.894025 0.13376 0.148439 0.604513 0.689693
0.318745 0.751111 0.214771 0.221637 0.782232 0.329274 0.210776 0.151637
0.691374 0.0600692 0.317934 0.25582 0.881472 0.734963 0.603998 0.855192
0.674639 0.717255 0.60677 0.343281 0.119153 0.340493 0.272141 0.202147
0.992593 0.0692194 0.256404 0.922214 0.129515 0.473576 0.863022 0.749924
0.115591 0.764342 0.484487 0.76594 0.970529 0.474724 0.5216 0.346497
0.214091 0.931094 0.622827 0.00131706 0.100016 0.103876 0.27451 0.912466
0.82846 0.000283631 0.784817 0.716793 0.711505 0.336038 0.723893 0.824162
julia> x = @FixedSize rand(3)
3-element FixedSizeArray{Tuple{3},Float64,1,Tuple{1},15} with indices 1:1:3:
0.09671375198196175
0.5290890298633636
0.2891468539564671
julia> Y = @FixedSize randn(2,3,4)
2×3×4 FixedSizeArray{Tuple{2,3,4},Float64,3,Tuple{1,2,6},31} with indices 1:1:2×1:1:3×1:1:4:
[:, :, 1] =
0.162202 -0.122953 0.268173
1.92397 1.85739 0.738878
[:, :, 2] =
-0.791952 -0.181461 -0.515349
0.160009 -0.827614 1.59702
[:, :, 3] =
1.17112 -0.902177 -0.949055
0.863118 -1.24432 0.94861
[:, :, 4] =
-0.914365 -0.681319 0.428368
-1.93309 -0.552553 0.859877
These RNGs should be relatively fast. You can set the seed with PaddedMatrices.VectorizedRNG.seed!
, which is independent of Random.seed!
. Setting the seed is not thread-safe (it effects all threads).
Benchmarking small, fixed size matrix multiplication
Small, fixed-size, array benchmarks, comparing with SMatrix
and MMatrix
from StaticArrays.jl.
PtrArray
is the fastest at all sizes, but SMatrix
is faster than the FixedSizeArray
for 2x2
, 3x3
and 8x8
matrices on my computer, while MMatrix
is only faster for 3x3
matrices (but not faster than PaddedArray
). Starting at 6x6
, MMatrix
is slower than the jmul!
method applied to regular-old Array{Float64,2}
s.
When creating
FixedSizeArray
s, they are padded by default. This is to align their columns in memory, which makes indexing the top of their columns faster, improving performance slightly.The
PtrArray
type has lower overhead, for some reason. The FixedSizeArray
is a mutable struct
that holds an NTuple
(the memory), and an aligned pointer to that memory. PtrArray
is just a struct
holding the pointer; any use must be protected with GC.@preserve memory
.
Beyond 8x8, multiplying regular (dynamically sized) Matrix{Float64}
using PaddedMatrices.jmul!
was faster than SMatrix
and MMatrix
multiplication. Of course, the StaticArrays documentation recommends them primarily for 10x10
arrays and smaller. Their compile times regress significantly beyond that, even more than the runtimes.
Worth pointing out that a low-hanging fruit for StaticArray
s performance is to start using muladd
. The SMatrix
type also has the advantage of getting the best performance out of a more convenient, non-mutating, API. The others were all benchmarked with in-place variants.
Results probably look different depending on instruction set.
Large, dynamically-sized array multiplication (single-threaded)
These benchmarks all used the base Matrix{T}
type, comparing OpenBLAS 0.3.9, Intel MKL through ccall
, Gaius.jl and PaddedMatrices.jmul!
.
Float64
benchmarks:
OpenBLAS had the best performance for large arrays, while MKL had the best performance between
40x40
and 200x200
.Once the arrays no longer fit in the L2 cache (beyond the 200x200), PaddedMatrices’ performance dropped, and it never really recovered. This suggests the library still needs work on how it handles memory bandwidth.
It implements this algorithm (image belongs to the BLIS team):
You can see the default tiling parameters for your architecture via
julia> mc, kc, nc = PaddedMatrices.matmul_params(Float64, Float64, Float64)
(200, 300, 600)
Arguments are the element types of C
, A
, and B
, respectively, for PaddedMatrices.jmul!(C, A, B)
(meaning C = A * B
).
The three return values are the size of outer-loop increments for the M
, K
, and N
loops, respectively, where C
is M x N
, A
is M x K
, and B
is K x N
.
That is, it will work with mc x kc
tiles of A
and kc x nc
tiles of B
at a time, to calculate mc x nc
tiles of C
.
These sizes are meant to both (a) let the A
tiles fit in L2 cache and the B
tiles in L3, following the above diagram, and (b) be integer multiples of the dimension of the micro-kernels.
If you would like to play around with different values, you can call
PaddedMatrices.jmul!(C, A, B, Val(mc), Val(kc), Val(nc))
But be careful – you’ll likely get corrupted answers or a crash if you make the tiles of A
and B
much larger than what PaddedMatrices.matmul_params
returns.
You can get the size of the micro kernel via:
julia> (PaddedMatrices.mᵣ * PaddedMatrices.VectorizationBase.pick_vector_width(Float64), PaddedMatrices.nᵣ)
(40, 5)
This means (on this architecture) that you’d want to keep mc
a multiple of 40, and nc
a multiple of 5.
Gaius.jl
(which doesn’t yet use the same algorithm) uses a “cache oblivious” recursive algorithm, but unfortunately cache (and TLB)-aware tiling algorithms are needed to keep the memory-boogeyman at bay.
Using Float32
, OpenBLAS and MKL are neck and neck for large arays, while MKL still dominates at the more moderate sizes:
Interestingly, PaddedMatrices gets a little over twice the GFLOPS with single precision (instead of just twice).
Those BLAS libraries don’t support integer multiplication, so that provides an area where where we have a lot to gain. Benchmarking vs LinearAlgebra’s generic matmul for Int64
(note the mislabled y
-axis, it should be GIOPS
, because these aren’t floating point operations):
We get a lot less operations per second out of
Int64
than Float64
, primarilly because SIMD Int64
multiplication is very slow. On AVX512
architectures, like the computer where I ran these benchmarks, it is implemented using the vpmullq
instruction, which has at best 1/3 the throughput as fma
.Without AVX512, but with AVX2, it’s multiplied using
Int32
multiplication, shifts, and additions.Without AVX2, you don’t have SIMD integer operations.
Int32
does better:
But is still slower than
Float64
. Note that Float64
can exactly represent every integer from -(2^53)
to 2^53
, which is better than Int32
, which can only represent -(2^31)
to 2^31-1
.
So maybe integer matrix multiplication isn’t that useful. Still, it’s nice to see the performance you can gain vs a more naive implementations.
Performance relative to OpenBLAS/MKL will likely be a little worse on AVX2, because PaddedMatrices
is currently using prefetch instructions to help with cache latency with AVX512, but the way I added calls to them resulted in using an excessive amount with AVX2 to the point that performance was worse than not using any.
Large, dynamically-sized array multiplication (multi-threaded)
There is a function PaddedMatrices.jmult!
, but performance is bad at the moment.
I had it spawn a new task for every tile of A
and B
. This is too few tasks to get good utilization out of a many-core CPU for all but very large matrices, but for very-large matrices the, number of tasks then quickly becomes excessive.
First of all, note that you can parallelize the M
and N
loops, corresponding to blocks of A
and B
, respectively. Parallelizing the K
loop would be awkward (e.g. race conditions on updating C
?), so I’ll ignore that possibility.
We could imagine a simple “end game” algorithm for extremely large matrices:
For very large matrices A
, with enough rows that you can divide reasonable blocks up among cores and not have an awkward remainder.
If the overhead on spawning tasks were smaller, the best approach for very large matrices may be just a slight modification of the current algorithm: multiply the size of the tiles of B
by Threads.nthreads()
. These are supposed to live in the L3 cache, which is shared between cores. Then spawn separate tasks, which take tiles of A
(in the L2 cache, local to specific cores) and multiply them with the giant tile of B
.
This would have the advantage of enabling very big tiles of B
, amortizing the cost of packing the A
-tiles.
To generalize this for matrices too small to qualify for 1.
, we could find out how many blocks of B
we need to get good utilization of the CPU cores, then divide up the L3 space for that many blocks of B
. Parallelize these B
blocks, and the A
blocks within them.
This algoirthm transitions nicely to the extreme case above.
If matrices are too small, we could start shrinking B
and A
block sizes for the sake of increased parallelism.
FixedSizeArrays use LoopVectorization for broadcasting
julia> using PaddedMatrices, StaticArrays, BenchmarkTools
julia> Afs = @FixedSize randn(7,9); Asm = SMatrix{7,9}(Array(Afs));
julia> bfs = @FixedSize rand(7); bsv = SVector{7}(bfs);
julia> @btime @. $Afs + log($bfs)
74.276 ns (1 allocation: 672 bytes)
7×9 FixedSizeArray{Tuple{7,9},Float64,2,Tuple{1,8},79} with indices 1:1:7×1:1:9:
0.76966 1.85181 -0.319665 1.61656 0.217049 0.382989 -0.746271 0.660974 0.393029
-3.42086 -4.45456 -4.39892 -3.04383 -2.43171 -2.8492 -5.49892 -3.58348 -3.77883
-2.11739 -2.75008 -0.982609 -1.16154 -1.06353 -2.63578 -2.49486 -1.59256 -2.19798
1.69544 -0.754236 -0.523444 -1.05262 -2.22792 -1.26843 -0.590602 1.47988 -1.0712
-1.21998 0.0110744 -1.47765 -0.348365 -0.381716 0.174942 -0.18653 -2.88394 -0.318937
-0.0642002 0.0172898 -0.807196 1.16906 -0.345746 1.38633 0.964985 0.449694 0.234085
-0.800639 -1.26078 0.485136 -1.02432 -0.131706 0.520508 -1.50308 -1.72517 -1.96775
julia> @btime @. $Asm + log($bsv)
293.517 ns (0 allocations: 0 bytes)
7×9 SArray{Tuple{7,9},Float64,2,63} with indices SOneTo(7)×SOneTo(9):
0.76966 1.85181 -0.319665 1.61656 0.217049 0.382989 -0.746271 0.660974 0.393029
-3.42086 -4.45456 -4.39892 -3.04383 -2.43171 -2.8492 -5.49892 -3.58348 -3.77883
-2.11739 -2.75008 -0.982609 -1.16154 -1.06353 -2.63578 -2.49486 -1.59256 -2.19798
1.69544 -0.754236 -0.523444 -1.05262 -2.22792 -1.26843 -0.590602 1.47988 -1.0712
-1.21998 0.0110744 -1.47765 -0.348365 -0.381716 0.174942 -0.18653 -2.88394 -0.318937
-0.0642002 0.0172898 -0.807196 1.16906 -0.345746 1.38633 0.964985 0.449694 0.234085
-0.800639 -1.26078 0.485136 -1.02432 -0.131706 0.520508 -1.50308 -1.72517 -1.96775
We pay a price in allocations, but that’s comparatively small.
Optional Sizing
julia> using PaddedMatrices
julia> A = StrideArray{Float64}(undef, (100,PaddedMatrices.Static(3)))
julia> axes(A,1) |> typeof
Base.OneTo{Int64}
julia> axes(A,2) |> typeof
VectorizationBase.StaticUnitRange{1,3}
LoopVectorization can use read sizes from a StaticUnitRange
to know the dimensions while optimizing a loop, and of course the Julia’s compiler itself/LLVM will be aware.
Again, this library is experimental and a work in progess, but I figured some folks may find it interesting, or – better yet – useful.