We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4)

Serious updates here!

I’ve now abandoned the recursive memory layout, now using:

jBLAS.jl
SIMDArrays.jl

They require Julia 1.0 or 0.7.

If anyone wants to try, you may have to ]dev SIMDArrays, and then after activating SIMDArrays, dev or add jBLAS from github. Not sure if it will point to the correct place as is.

Following a suggestion by @RoyiAvital in Strange performance of a loop - #36 by RoyiAvital, I am padding the number of elements in the first axis to be a multiple of SIMDVector length.
I use CpuId.jl for info on SIMD vector length, cache sizes, etc. However, I haven’t actually tested it on other architectures yet to make sure performance is as it should be.
Before moving onto benchmarks vs StaticArray’s MMatrices and Intel MKL, I want to point out that this: Julia one third slower than ccall-ing `@code_native` assembly compiled with gcc? is seemingly solved by declaring the matrices const and not interpolating into the expression. When I do that, the Julia functions become as fast as the equivalent ccall. So, on that note, a few helper functions:

using SIMDArrays, StaticArrays, BenchmarkTools, LinearAlgebra
function copyloop!(x, y) # Broadcasts slow to compile at the moment; will fix when I implement broadcasting!
    @boundscheck size(x) == size(y) || throw(BoundsError())
    @inbounds for i ∈ eachindex(x,y)
        x[i] = y[i]
    end
end
function Base.convert(::Type{MMatrix}, A::SIMDArrays.SizedSIMDMatrix{M,N,T}) where {M,N,T}
    out = MMatrix{M,N,T}(undef)
    copyloop!(out, A)
    out
end
function create_constants(M,N,P)
    dsym, asym, xsym = Symbol(:D_, M,:_,P), Symbol(:A_, M,:_,N), Symbol(:X_, N,:_,P)
    @eval const $dsym, $asym, $xsym = SIMDArrays.randsimd($M,$P), SIMDArrays.randsimd($M,$N), SIMDArrays.randsimd($N,$P)
    @eval const $(Symbol(:m,dsym)), $(Symbol(:m,asym)), $(Symbol(:m,xsym)) = convert(MMatrix, $dsym), convert(MMatrix, $asym), convert(MMatrix, $xsym)
    nothing
end

Notationally, let D = A * X, where D is M x P, A is M x N, and X is N x P.
Now onto the benchmarks (on a computer with avx 512):

julia> create_constants(5,5,5)

julia> @btime mul!(mD_5_5, mA_5_5, mX_5_5) #multiplying StaticArrays MMatrices
  20.477 ns (0 allocations: 0 bytes)
5×5 MArray{Tuple{5,5},Float64,2,25}:
 1.66744  1.35667  1.34789  0.93313   1.13065 
 1.74005  1.59392  1.21913  0.906912  0.814635
 1.41563  1.76972  1.90589  0.88914   1.40555 
 1.79611  1.4072   1.09164  0.97643   1.03881 
 1.59758  1.87103  2.00119  0.914708  1.40032 

julia> @btime mA_5_5 * mX_5_5 #multiplying StaticArrays MMatrices creating stack allocated StaticMatrix
  23.261 ns (0 allocations: 0 bytes)
5×5 SArray{Tuple{5,5},Float64,2,25}:
 1.66744  1.35667  1.34789  0.93313   1.13065 
 1.74005  1.59392  1.21913  0.906912  0.814635
 1.41563  1.76972  1.90589  0.88914   1.40555 
 1.79611  1.4072   1.09164  0.97643   1.03881 
 1.59758  1.87103  2.00119  0.914708  1.40032 

julia> @btime mul!(D_5_5, A_5_5, X_5_5); D_5_5 #multiplying SIMDArrays
  6.582 ns (0 allocations: 0 bytes)
5×5 SIMDArrays.SizedSIMDArray{Tuple{5,5},Float64,2,8,40}:
 1.66744  1.35667  1.34789  0.93313   1.13065 
 1.74005  1.59392  1.21913  0.906912  0.814635
 1.41563  1.76972  1.90589  0.88914   1.40555 
 1.79611  1.4072   1.09164  0.97643   1.03881 
 1.59758  1.87103  2.00119  0.914708  1.40032 

julia> D_5_5.data
(1.66744138262064, 1.740048608459989, 1.415625303408632, 1.7961078018359855, 1.5975750881580377, 0.0, 0.0, 0.0, 1.356668103422378, 1.5939210629707454, 1.769720982419278, 1.4072025619380528, 1.8710277860227025, 0.0, 0.0, 0.0, 1.3478852456667783, 1.2191309243468347, 1.9058890396186112, 1.091636798493075, 2.001191775015244, 0.0, 0.0, 0.0, 0.9331304568989827, 0.9069123309829086, 0.8891401784342474, 0.9764296003231432, 0.9147076423296147, 0.0, 0.0, 0.0, 1.130651372750223, 0.814634811020856, 1.4055450158840463, 1.0388080685580772, 1.4003242549924564, 0.0, 0.0, 0.0)

In the last line, I print the tuple of elements inside a 5 x 5 SIMDMatrix, so you can see that each column is padded with 3 0s. I need to initialize the padded memory, otherwise subnormals will make multiplication very slow. I didn’t try set_zero_subnormals(true), because I didn’t want to have to change the global state.
Thanks to this padding, performance remains stable as we increase M. MMatrices do a little better once we reach 8 x 5, probably because it is easier to vectorize:

julia> create_constants(8,5,5)
WARNING: redefining constant X_5_5
WARNING: redefining constant mX_5_5

julia> @btime mul!(mD_8_5, mA_8_5, mX_5_5); #multiplying StaticArrays MMatrices
  17.613 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_8_5, A_8_5, X_5_5); #multiplying SIMDArrays
  6.587 ns (0 allocations: 0 bytes)

Make the matrices a little larger, and StaticArrays begins to struggle:

julia> create_constants(5,8,8)

julia> @btime mul!(mD_5_8, mA_5_8, mX_8_8); #multiplying StaticArrays MMatrices
  59.367 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_5_8, A_5_8, X_8_8); #multiplying SIMDArrays
  14.363 ns (0 allocations: 0 bytes)

Now at the boundary of where MMatrices fall back to a BLAS call:

julia> BLAS.set_num_threads(1); BLAS.vendor()
:mkl

julia> create_constants(14,13,14)

julia> @btime mul!(mD_14_14, mA_14_13, mX_13_14); #multiplying StaticArrays MMatrices
  441.764 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_14_14, A_14_13, X_13_14); #multiplying SIMDArrays
  55.935 ns (0 allocations: 0 bytes)

julia> create_constants(14,14,14)
WARNING: redefining constant D_14_14
WARNING: redefining constant mD_14_14

julia> @btime mul!(mD_14_14, mA_14_14, mX_14_14); #multiplying with MKL
  227.986 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_14_14, A_14_14, X_14_14); #multiplying SIMDArrays
  59.304 ns (0 allocations: 0 bytes)

Now, trying a few different sizes:

julia> create_constants(24,24,24);

julia> @btime mul!(mD_24_24, mA_24_24, mX_24_24);#multiplying with MKL
  404.270 ns (0 allocations: 0 bytes)

julia> @btime mul!(D_24_24, A_24_24, X_24_24); #multiplying SIMDArrays
  247.178 ns (0 allocations: 0 bytes)

julia> create_constants(49,49,49);

julia> @btime mul!(mD_49_49, mA_49_49, mX_49_49);#multiplying with MKL
  2.858 μs (0 allocations: 0 bytes)

julia> @btime mul!(D_49_49, A_49_49, X_49_49); #multiplying SIMDArrays
  2.348 μs (0 allocations: 0 bytes)

julia> create_constants(128,128,128);

julia> @btime mul!(mD_128_128, mA_128_128, mX_128_128);#multiplying with MKL
  36.124 μs (0 allocations: 0 bytes)

julia> @btime mul!(D_128_128, A_128_128, X_128_128); #multiplying SIMDArrays
  35.616 μs (0 allocations: 0 bytes)

julia> create_constants(200,200,200);

julia> @btime mul!(mD_200_200, mA_200_200, mX_200_200);#multiplying with MKL
  161.158 μs (0 allocations: 0 bytes)

julia> @btime mul!(D_200_200, A_200_200, X_200_200); #multiplying SIMDArrays
  147.202 μs (0 allocations: 0 bytes)

To put these times in perspective, this is what I get from OpenBLAS:

julia> using LinearAlgebra, BenchmarkTools

julia> const D, A, X = randn(200,200), randn(200,200), randn(200,200);

julia> BLAS.set_num_threads(1); BLAS.vendor()
:openblas64

julia> @benchmark mul!(D, A, X)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     335.363 μs (0.00% GC)
  median time:      339.197 μs (0.00% GC)
  mean time:        345.639 μs (0.00% GC)
  maximum time:     488.796 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Beyond this size, this computer starts running out of L2 cache, and I haven’t tested that nearly as much. It’s likely I’ll need to implement prefetching.

julia> using jBLAS: blocking_structure, CACHE_SIZE, REGISTER_SIZE

julia> CACHE_SIZE .÷ sizeof(Float64)
(4096, 131072, 1802240)

julia> 3 * 204 * 200
122400

It uses the cache sizes to determine the blocking structure. For example, choosing letting M,N,P = 749,777,832 we get:

julia> simd_len = REGISTER_SIZE ÷ sizeof(Float64)
8

julia> blocking_structure(cld(749,simd_len)*simd_len,777,832,Float64) # cld(x,vl)*vl rounds up to nearest vl
(((16, 129, 14), (256, 129, 252), (752, 777, 796)), 3)

describes the blocking structure. The first triple describes the shape of the compute kernel, as 16 x 129 x 14. This is small enough to fit in the L1 cache. The next tripple describes the size of blocks for the L2 cache, which it will iterate over, replacing only one chunk at a time. Ie, (1,1),(2,1),(3,1),(3,2),(1,2),(2,2)... order, not (1,1),(2,1),(3,1),(1,2),(2,2),(3,2)..., to help reduce how much swapping of memory we have to do between the L1 and L2 caches.

Anyway, I think we can get a lot of benefits from a Julia BLAS library.
As I hopefully demonstrated here, it offers the possibility of being dramatically faster than current fast choices (StaticArrays, OpenBLAS) across small and so far medium sizes.
I think it also has the possibility of adopting to new architectures quickly, so long as they’re supported by CpuId.jl. And of also ease things like

  1. Supporting different types. How fast could we get complex matrix multiplication be when using StructsOfArrays.jl? Note that things aren’t fully generic. A matrix where elements are organized real1, imaginary1, real2, imaginary2... still isn’t an efficient organization for vectorization. But tricks like StructsOfArrays solve that.
  2. We can easily implement a diversity of kernels, like (A - B) * (X + Y), which could let us try an efficient go at the Strassen algorithm.

While the matrices are all currently typed based on size, all these matrix operations are quick to compile. I haven’t benchmarked the functions that figure out the blocking structure, but that sort of stuff ought to be optimized if it’s all done at runtime.

39 Likes