Serious updates here!
I’ve now abandoned the recursive memory layout, now using:
jBLAS.jl
SIMDArrays.jl
They require Julia 1.0 or 0.7.
If anyone wants to try, you may have to ]dev SIMDArrays
, and then after activating SIMDArrays, dev or add jBLAS from github. Not sure if it will point to the correct place as is.
Following a suggestion by @RoyiAvital in Strange performance of a loop - #36 by RoyiAvital, I am padding the number of elements in the first axis to be a multiple of SIMDVector length.
I use CpuId.jl
for info on SIMD vector length, cache sizes, etc. However, I haven’t actually tested it on other architectures yet to make sure performance is as it should be.
Before moving onto benchmarks vs StaticArray’s MMatrices and Intel MKL, I want to point out that this: Julia one third slower than ccall-ing `@code_native` assembly compiled with gcc? is seemingly solved by declaring the matrices const
and not interpolating into the expression. When I do that, the Julia functions become as fast as the equivalent ccall
. So, on that note, a few helper functions:
using SIMDArrays, StaticArrays, BenchmarkTools, LinearAlgebra
function copyloop!(x, y) # Broadcasts slow to compile at the moment; will fix when I implement broadcasting!
@boundscheck size(x) == size(y) || throw(BoundsError())
@inbounds for i ∈ eachindex(x,y)
x[i] = y[i]
end
end
function Base.convert(::Type{MMatrix}, A::SIMDArrays.SizedSIMDMatrix{M,N,T}) where {M,N,T}
out = MMatrix{M,N,T}(undef)
copyloop!(out, A)
out
end
function create_constants(M,N,P)
dsym, asym, xsym = Symbol(:D_, M,:_,P), Symbol(:A_, M,:_,N), Symbol(:X_, N,:_,P)
@eval const $dsym, $asym, $xsym = SIMDArrays.randsimd($M,$P), SIMDArrays.randsimd($M,$N), SIMDArrays.randsimd($N,$P)
@eval const $(Symbol(:m,dsym)), $(Symbol(:m,asym)), $(Symbol(:m,xsym)) = convert(MMatrix, $dsym), convert(MMatrix, $asym), convert(MMatrix, $xsym)
nothing
end
Notationally, let D = A * X, where D is M x P, A is M x N, and X is N x P.
Now onto the benchmarks (on a computer with avx 512):
julia> create_constants(5,5,5)
julia> @btime mul!(mD_5_5, mA_5_5, mX_5_5) #multiplying StaticArrays MMatrices
20.477 ns (0 allocations: 0 bytes)
5×5 MArray{Tuple{5,5},Float64,2,25}:
1.66744 1.35667 1.34789 0.93313 1.13065
1.74005 1.59392 1.21913 0.906912 0.814635
1.41563 1.76972 1.90589 0.88914 1.40555
1.79611 1.4072 1.09164 0.97643 1.03881
1.59758 1.87103 2.00119 0.914708 1.40032
julia> @btime mA_5_5 * mX_5_5 #multiplying StaticArrays MMatrices creating stack allocated StaticMatrix
23.261 ns (0 allocations: 0 bytes)
5×5 SArray{Tuple{5,5},Float64,2,25}:
1.66744 1.35667 1.34789 0.93313 1.13065
1.74005 1.59392 1.21913 0.906912 0.814635
1.41563 1.76972 1.90589 0.88914 1.40555
1.79611 1.4072 1.09164 0.97643 1.03881
1.59758 1.87103 2.00119 0.914708 1.40032
julia> @btime mul!(D_5_5, A_5_5, X_5_5); D_5_5 #multiplying SIMDArrays
6.582 ns (0 allocations: 0 bytes)
5×5 SIMDArrays.SizedSIMDArray{Tuple{5,5},Float64,2,8,40}:
1.66744 1.35667 1.34789 0.93313 1.13065
1.74005 1.59392 1.21913 0.906912 0.814635
1.41563 1.76972 1.90589 0.88914 1.40555
1.79611 1.4072 1.09164 0.97643 1.03881
1.59758 1.87103 2.00119 0.914708 1.40032
julia> D_5_5.data
(1.66744138262064, 1.740048608459989, 1.415625303408632, 1.7961078018359855, 1.5975750881580377, 0.0, 0.0, 0.0, 1.356668103422378, 1.5939210629707454, 1.769720982419278, 1.4072025619380528, 1.8710277860227025, 0.0, 0.0, 0.0, 1.3478852456667783, 1.2191309243468347, 1.9058890396186112, 1.091636798493075, 2.001191775015244, 0.0, 0.0, 0.0, 0.9331304568989827, 0.9069123309829086, 0.8891401784342474, 0.9764296003231432, 0.9147076423296147, 0.0, 0.0, 0.0, 1.130651372750223, 0.814634811020856, 1.4055450158840463, 1.0388080685580772, 1.4003242549924564, 0.0, 0.0, 0.0)
In the last line, I print the tuple of elements inside a 5 x 5
SIMDMatrix, so you can see that each column is padded with 3 0s. I need to initialize the padded memory, otherwise subnormals will make multiplication very slow. I didn’t try set_zero_subnormals(true)
, because I didn’t want to have to change the global state.
Thanks to this padding, performance remains stable as we increase M
. MMatrices do a little better once we reach 8 x 5, probably because it is easier to vectorize:
julia> create_constants(8,5,5)
WARNING: redefining constant X_5_5
WARNING: redefining constant mX_5_5
julia> @btime mul!(mD_8_5, mA_8_5, mX_5_5); #multiplying StaticArrays MMatrices
17.613 ns (0 allocations: 0 bytes)
julia> @btime mul!(D_8_5, A_8_5, X_5_5); #multiplying SIMDArrays
6.587 ns (0 allocations: 0 bytes)
Make the matrices a little larger, and StaticArrays begins to struggle:
julia> create_constants(5,8,8)
julia> @btime mul!(mD_5_8, mA_5_8, mX_8_8); #multiplying StaticArrays MMatrices
59.367 ns (0 allocations: 0 bytes)
julia> @btime mul!(D_5_8, A_5_8, X_8_8); #multiplying SIMDArrays
14.363 ns (0 allocations: 0 bytes)
Now at the boundary of where MMatrices
fall back to a BLAS call:
julia> BLAS.set_num_threads(1); BLAS.vendor()
:mkl
julia> create_constants(14,13,14)
julia> @btime mul!(mD_14_14, mA_14_13, mX_13_14); #multiplying StaticArrays MMatrices
441.764 ns (0 allocations: 0 bytes)
julia> @btime mul!(D_14_14, A_14_13, X_13_14); #multiplying SIMDArrays
55.935 ns (0 allocations: 0 bytes)
julia> create_constants(14,14,14)
WARNING: redefining constant D_14_14
WARNING: redefining constant mD_14_14
julia> @btime mul!(mD_14_14, mA_14_14, mX_14_14); #multiplying with MKL
227.986 ns (0 allocations: 0 bytes)
julia> @btime mul!(D_14_14, A_14_14, X_14_14); #multiplying SIMDArrays
59.304 ns (0 allocations: 0 bytes)
Now, trying a few different sizes:
julia> create_constants(24,24,24);
julia> @btime mul!(mD_24_24, mA_24_24, mX_24_24);#multiplying with MKL
404.270 ns (0 allocations: 0 bytes)
julia> @btime mul!(D_24_24, A_24_24, X_24_24); #multiplying SIMDArrays
247.178 ns (0 allocations: 0 bytes)
julia> create_constants(49,49,49);
julia> @btime mul!(mD_49_49, mA_49_49, mX_49_49);#multiplying with MKL
2.858 μs (0 allocations: 0 bytes)
julia> @btime mul!(D_49_49, A_49_49, X_49_49); #multiplying SIMDArrays
2.348 μs (0 allocations: 0 bytes)
julia> create_constants(128,128,128);
julia> @btime mul!(mD_128_128, mA_128_128, mX_128_128);#multiplying with MKL
36.124 μs (0 allocations: 0 bytes)
julia> @btime mul!(D_128_128, A_128_128, X_128_128); #multiplying SIMDArrays
35.616 μs (0 allocations: 0 bytes)
julia> create_constants(200,200,200);
julia> @btime mul!(mD_200_200, mA_200_200, mX_200_200);#multiplying with MKL
161.158 μs (0 allocations: 0 bytes)
julia> @btime mul!(D_200_200, A_200_200, X_200_200); #multiplying SIMDArrays
147.202 μs (0 allocations: 0 bytes)
To put these times in perspective, this is what I get from OpenBLAS
:
julia> using LinearAlgebra, BenchmarkTools
julia> const D, A, X = randn(200,200), randn(200,200), randn(200,200);
julia> BLAS.set_num_threads(1); BLAS.vendor()
:openblas64
julia> @benchmark mul!(D, A, X)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 335.363 μs (0.00% GC)
median time: 339.197 μs (0.00% GC)
mean time: 345.639 μs (0.00% GC)
maximum time: 488.796 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Beyond this size, this computer starts running out of L2 cache, and I haven’t tested that nearly as much. It’s likely I’ll need to implement prefetching.
julia> using jBLAS: blocking_structure, CACHE_SIZE, REGISTER_SIZE
julia> CACHE_SIZE .÷ sizeof(Float64)
(4096, 131072, 1802240)
julia> 3 * 204 * 200
122400
It uses the cache sizes to determine the blocking structure. For example, choosing letting M,N,P = 749,777,832 we get:
julia> simd_len = REGISTER_SIZE ÷ sizeof(Float64)
8
julia> blocking_structure(cld(749,simd_len)*simd_len,777,832,Float64) # cld(x,vl)*vl rounds up to nearest vl
(((16, 129, 14), (256, 129, 252), (752, 777, 796)), 3)
describes the blocking structure. The first triple describes the shape of the compute kernel, as 16 x 129 x 14. This is small enough to fit in the L1 cache. The next tripple describes the size of blocks for the L2 cache, which it will iterate over, replacing only one chunk at a time. Ie, (1,1),(2,1),(3,1),(3,2),(1,2),(2,2)...
order, not (1,1),(2,1),(3,1),(1,2),(2,2),(3,2)...
, to help reduce how much swapping of memory we have to do between the L1 and L2 caches.
Anyway, I think we can get a lot of benefits from a Julia BLAS library.
As I hopefully demonstrated here, it offers the possibility of being dramatically faster than current fast choices (StaticArrays, OpenBLAS) across small and so far medium sizes.
I think it also has the possibility of adopting to new architectures quickly, so long as they’re supported by CpuId.jl
. And of also ease things like
- Supporting different types. How fast could we get complex matrix multiplication be when using
StructsOfArrays.jl
? Note that things aren’t fully generic. A matrix where elements are organizedreal1, imaginary1, real2, imaginary2...
still isn’t an efficient organization for vectorization. But tricks likeStructsOfArrays
solve that. - We can easily implement a diversity of kernels, like (A - B) * (X + Y), which could let us try an efficient go at the Strassen algorithm.
While the matrices are all currently typed based on size, all these matrix operations are quick to compile. I haven’t benchmarked the functions that figure out the blocking structure, but that sort of stuff ought to be optimized if it’s all done at runtime.