@inbounds: is the compiler now so smart that this is no longer necessary?

Did you disabled multi threading inside the BLAS routines? I noticed that at least in 0.6 there is a significant overhead if BLAS tries to multi thread.

No, I haven’t tried disabling multithreading.

My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.

julia> BLAS.set_num_threads(1)

julia> gradN = rand(60, 40);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  11.014 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.425 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.289 μs (0 allocations: 0 bytes)

julia> BLAS.set_num_threads(10)

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  10.994 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  8.420 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  9.277 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  37.251 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  52.913 μs (0 allocations: 0 bytes)

Times were the same, and I never saw CPU usage exceed 100%.

Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.

BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:

julia> using StaticArrays

julia> gradN = @MMatrix rand(3, 2);

julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  5.714 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  8.696 ns (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  8.697 ns (0 allocations: 0 bytes)

MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n) ) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.

1 Like

This includes BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke), which eventually becomes the fastest method. (In my application gradN is not likely to be bigger than 64x3.)

Ah, okay. I think the skinniness of your matrix is part of the problem.
I would’ve called a 14x14 matrix “small”, and that has more elements than the 64x3. (Although X * X’ would be much larger for the latter).

3 may also be an awkward width. Eg, on my computer, BLAS takes about the same long for 64x3 as it does for 64x4, while the loop versions are of course about 33% faster for the former:

julia> gradN = rand(64, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  6.355 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  2.933 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.800 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  3.731 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  4.930 μs (0 allocations: 0 bytes)

julia> gradN = rand(64, 4);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
  6.355 μs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  3.100 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.981 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.935 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
  6.628 μs (0 allocations: 0 bytes)

Very interesting data. Thanks! And thanks for your suggestions. A lot of food for thought…

Are you using Julia + Intel MKL?
If you use OpenBLAS you’re leaving a lot of performance on the table.

You should try using Intel MKL.
I wonder if Julia uses Intel MKL with low Overhead (Intel MKL has a calling method which has less overhead and gets great performance for small matrices).

No MKL. I am on 0.7 at the moment.

Here, I am comparing to MKL:

julia> using LinearAlgebra

julia> BLAS.vendor()
:mkl

julia> versioninfo()
Julia Version 0.7.0-beta.200
Commit 9277d3af2d (2018-07-07 19:47 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

Currently, SIMD.jl is broken on master: https://github.com/JuliaLang/julia/issues/27694
So pardon this work around:

struct Vec{N,T} <: AbstractVector{T}
    # data::NTuple{N,Core.VecElement{T}}
    data::NTuple{N,T}
    Vec(data::Vararg{T,N}) where {N,T} = new{N,T}(data)
end
@inline Base.getindex(x::Vec, i) = x.data[i]
@inline Base.length(::Vec{N}) where N = N
@inline Base.size(::Vec{N}) where N = (N,)
@inline Base.eltype(::Vec{N,T}) where {N,T} = T
@inline function vload(::Type{Vec{N,T}}, x::AbstractArray{T}, i) where {N,T}
    unsafe_load(Base.unsafe_convert(Ptr{Vec{N,T}}, pointer(x)) + sizeof(T)*i)
end
@inline function vstore!(x::AbstractArray{T}, v::Vec{N,T}, i) where {N,T}
    unsafe_store!(Base.unsafe_convert(Ptr{Vec{N,T}}, pointer(x)) + sizeof(T)*i, v)
end
function create_quote()
    q = quote
        @inbounds @fastmath begin
            $(Expr(:meta, :inline))
            Vec()
        end
    end
    q, q.args[2].args[3].args[3].args[4].args
end
@generated function Base.fma(x::Vec{N,T}, y::T, z::Vec{N,T}) where {N,T}
    q, qa = create_quote()
    for n ∈ 1:N
        push!(qa, :(x[$n] * y + z[$n]))
    end
    q
end

LLVM is good enough at autovectorizing that we don’t really need to be explicit anyway.

Here is a new version that uses the above functions; I probably should have just called it Syrk:

@generated function add_mggt_ut_only!(Ke, gradN, mult, ::Val{mdim}) where mdim
    L = 8
    quote
        @assert size(Ke, 1) == size(Ke, 2)
        Kedim = size(Ke, 1)
        # nne, mdim = size(gradN)
        V = Vec{$L,Float64}
        SIMD_loop_max = Kedim ÷ $L * $L -1
        @inbounds for c = 1:Kedim# ÷ $L * $L
            Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
            for r = 0:$L:min(c-1,SIMD_loop_max)
                Ktemp = vload(V, Ke, r + (c-1)*Kedim)
                Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(vload(V, gradN, r+(m-1)*Kedim), gradNc_m, Ktemp))
                vstore!(Ke, Ktemp, r + (c-1)*Kedim)
            end
        end
        remainder = Kedim % $L
        if remainder > 0
            bound = Kedim-remainder+1
            @inbounds for c = bound:Kedim
                Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
                for r = bound:c
                    Ktemp = Ke[r,c]
                    Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(gradN[r,m], gradNc_m, Ktemp))
                    Ke[r,c] = Ktemp
                end
            end
        end
        return true
    end
end

It’s kind of naive, and I wrote it assuming that mdim is very small, For example, 3.
It also gives jagged results.
The upper triangle is complete, but there’s lots of stuff extending into the lower triangle. The numbers there aren’t wrong, I just didn’t bother continuing in the lower triangle any further beyond completing a vector.

It’s the version taking a Val argument, for the number of columns in gradN:

julia> using LinearAlgebra, BenchmarkTools

julia> mult = 1.4
1.4

julia> gradN = rand(64, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.397 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
  688.748 ns (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  1.425 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  713.936 ns (0 allocations: 0 bytes)

julia> gradN = rand(63, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.262 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
  780.188 ns (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  1.555 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  907.872 ns (0 allocations: 0 bytes)

julia> gradN = rand(65, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.521 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
  853.724 ns (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  1.679 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  1.094 μs (0 allocations: 0 bytes)

EDIT:
Comparing to OpenBLAS:

julia> mult = 1.4
1.4

julia> gradN = rand(64, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.396 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
  741.113 ns (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  2.922 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.778 μs (0 allocations: 0 bytes)

julia> gradN = rand(63, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.255 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
  745.946 ns (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  2.993 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.909 μs (0 allocations: 0 bytes)

julia> gradN = rand(65, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
  4.522 μs (0 allocations: 0 bytes)

julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
  705.270 ns (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
  3.076 μs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
  2.996 μs (0 allocations: 0 bytes)
1 Like

Very cool!

Does Julia use the Low Overhead Mode of Intel MKL for small matrices?

No, apparently not:

The link does say that for C/C++, all you have to do is add -DMKL_DIRECT_CALL or -DMKL_DIRECT_CALL_SEQ. So, maybe we could try adding one of those to MKL_LDFLAGS and building from source later to test it?


Not sure if that’s correct / that’d work.

That would be great of you could pull it out and we’ll be able to see what are the gains made.

Your code is clearly the fastest in my tests.
image
For your reference: this is how the curves were generated

    twi = @belapsed add_mggt_ut_only!($Ke, $gradN, 1.0)
    tsn = @belapsed 1.0 * ($gradN*Transpose($gradN))
    tmn = @belapsed 1.0 * mul!($Ke, $gradN, Transpose($gradN))
    two = @belapsed add_mggt_ut_only_wo!($Ke, $gradN, 1.0)
    tsd = @belapsed 1.0 .* ($gradN*Transpose($gradN))
    tmd = @belapsed 1.0 .* mul!($Ke, $gradN, Transpose($gradN))
    tgd = @belapsed BLAS.gemm!('N', 'T', 1.0, $gradN, $gradN, 0.0, $Ke)
    tw2 = @belapsed add_mggt_ut_only_g!($Ke, $gradN, 1.0, Val(2))

Performance of the generated code is excellent, I only wish it was easier to read :frowning:

1 Like

What is striking on the graph that includes the generated-loops function of @Elrod is that it yields excellent performance both for very small matrices and for large matrices. This is to be contrasted with gemm!, which apparently has a significant overhead that eventually gets amortized, but is nevertheless significant for small matrix sizes. (Curiously the plain matrix-matrix multiplication A*B, and the mul! function both seem to have an overhead on top of gemm!. I don’t know yet where. @andreasnoack perhaps might be able to help?)

So given the above, I am led to believe that a pure-Julia blas-like implementations might be realistic for things like matrix-matrix multiplication. For larger matrices the packing to facilitate cache access are the smart thing to do, but the packing should probably be turned off for small matrices. BLAS cannot do that: hence the disappointing performance for small matrices. That’s my thinking, and I wonder if the experts would care to chime in?

One more update: the matrices can be StaticArrays (MMatrix). Unfortunately, only for small matrices. If one could decide dynamically (with type stability) whether it should be an MMatrix or a regular matrix, the matrix multiplication could be performed even faster for small matrices:

The bottom curve is for the matrix multiplication carried out for MMatrices (up to size 10). EDIT: one of the matrices (gradN) is an MMatrix always. The square matrix Ke is an MMatrix only up to size 10x10.

I compiled it with the flag -DMKL_DIRECT_CALL added, but did not see any difference. So I am guessing that isn’t how you’re supposed to do that. Can’t say I’m very familiar with the build system…

@PetrKryslUCSD
Yeah, I need to work on the readability of the code. :frowning:
I should spend some time eventually to come up with a bunch of abstractions, and then either try to implement them as macros, or at least as functions that modify a quote in a clear way, to try and make it easier to read, write, and maintain.

In that particular case, L is the unrolled length. If Kedim is large, you would want that to be 4x your vector length. For a computer with avx, that’d be 44 = 16, and with avx512, that’d be 84 = 32.
I chose 8, because that seemed better for Kedim values of around 64.

Of course, if we knew Kedim at compile time, we could just choose appropriately – or, ideally, simplify the expression a bit so that the compiler automatically chooses at compile time. (Remember on one of my computers, the compiler chose 32 for the @inbounds version, one of the reasons it was slow at small sizes).

Whenever you see @nexprs or Base.Cartesian.@nexprs, you can think of that as an unrolled for loop.

julia> using Base.Cartesian

julia> @macroexpand @nexprs 4 i -> w[i] = x[i] * y[1+5*(i-1)] + z[i]
quote
    w[1] = x[1] * y[1] + z[1]
    w[2] = x[2] * y[6] + z[2]
    w[3] = x[3] * y[11] + z[3]
    w[4] = x[4] * y[16] + z[4]
end

Basic explanation of what this is doing / what I was thinking:

        @inbounds for c = 1:Kedim# ÷ $L * $L
            Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
            for r = 0:$L:min(c-1,SIMD_loop_max)
                Ktemp = vload(V, Ke, r + (c-1)*Kedim)
                Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(vload(V, gradN, r+(m-1)*Kedim), gradNc_m, Ktemp))
                vstore!(Ke, Ktemp, r + (c-1)*Kedim)
            end
        end

An element of Ke equals the dot product of two rows of gradN. While avx512 and GPUs can “gather” various noncontigous elements into a vector for SIMD-style operations, normal avx cannot (and I think it is slow for avx512). What is fast is loading a chunk of numbers one after the other, or broadcasting a single number across a register (ie, loading one number into every element).

a = [1.0, 2.0, 3.0, 4.0]

So for avx, where we can load 4 doubles at a time, we could either load 1.,2.,3.,4., or we could fill a register with 4 copies of any of these numbers, like 3.,3.,3.,3.. Similar story for storage – we can efficiently store a vector contiguously, but splitting it up is slow. So, let’s take a look at Ke

Ke += [gradN[1,:]'*gradN[1,:] gradN[1,:]'*gradN[2,:] ...;
        gradN[2,:]'*gradN[1,:] gradN[2,:]'*gradN[2,:] ... ;
        gradN[3,:]'*gradN[1,:] gradN[3,:]'*gradN[2,:] ... ;
        gradN[4,:]'*gradN[1,:] gradN[4,:]'*gradN[2,:] ... ;]

So, to calculate the first 4 elements of column 1 of Ke, what we want is:

Ke[1:4,1] .+= gradN[1:4,1] .* gradN[1,1] .+ gradN[1:4,2] .* gradN[1,2] .+ gradN[1:4,3] .* gradN[1,3] .+...

Keeping that in mind, the loops, in order from outer to inner:

  1. Fills in 1 column of Ke at a time (but we stop as soon as we both finish the upper triangle, and complete a vector). Here, we pull out the mdim elements gradN[c,1], gradN[c,2], …,gradN[c,mdim]. (The compiler broadcasts automatically). I also moved the multiplication by mult here.
Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
  1. Loads one set of vectors (a total of L elements) at a time from column c. With L=8 and avx2, that would be gradN[1:4,c] and gradN[5:8,c]. I don’t have to be explicit about the actual length, it just should be some multiple of the SIMD length. Multiple operations can overlap on the same core, so doing 16 will not take 4x longer than doing 4. But if Kedim == 29, L==4 takes care of all but the last (29%4 == 1), while it leaves a whole lot of non-vectorized elements if L == 16.
    How much we actually want to unroll it depends on how big we think it is. Realistically, because we’re only doing the upper triangle, it might make sense to split the outer c loop into pieces, with different values of L for each one. Especially if you decide to make Kedim known at compile time (eg, by using MMatrices), in which case any extra loops will get compiled away when unnecessary.
  2. For the inner unrolled loop, we just use fma instructions across the columns
Ktemp = vload(V, Ke, r + (c-1)*Kedim) # loads Ke[(1:L) + L*iter,c]
Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(vload(V, gradN, r+(m-1)*Kedim), gradNc_m, Ktemp))
vstore!(Ke, Ktemp, r + (c-1)*Kedim) # stores Ke[(1:L) + L*iter,c]
# The inner `@nexprs` part is doing this:
Ke[1:4,1] .+= gradN[1:4,1] .* gradN[1,1] .+ gradN[1:4,2] .* gradN[1,2] .+ gradN[1:4,3] .* gradN[1,3] .+... 

Yes, I think we can write a BLAS in Julia.
I’ve been playing around with it for a little while. You can check out some work here:

Major issues:

  1. It only works for matrices that are a multiple of the kernel size right now. I haven’t filled in the missing edges yet.
  2. Only works for MMatrices from StaticArrays. Once I finish the rest, I’ll have a non-generated version that works for regular old arrays (and has probably just a tiny bit more overhead, because of calculating some things at runtime).
  3. I don’t have played with kernels involving matrix transposes yet.
  4. I haven’t implemented a recursive algorithm yet. I think if I load submatrices recursively, I might be able to scale better. Current scaling is a lot worse than O(n^3) on my Intel processor with avx512, but only a tiny bit worse with Ryzen. The avx512 has 4x the throughput as Ryzen per core, so throughput / memory speed is a lot worse with Intel.
  5. No threading!

That said, it looks good! Versus OpenBLAS with avx-512, modest sized matrices:

julia> using OhMyREPL, jBLAS, BenchmarkTools, LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> A128_128 = mrandn(128,128); # WAY faster than @MMatrix randn(M, N) for M,N > TINY_NUMBER

julia> B128_126 = mrandn(128,126);

julia> C128_126 = mrandn(128,126);

julia> jmul!(C128_126, A128_128, B128_126) # Confirming the answer is correct
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
  -7.01447    -2.98439    -5.74117    -9.92775   -1.63882   -17.4866    …    2.50444   -15.7145     -2.52175    -9.72732     8.23362    -0.0480611
 -13.3085      8.11033    11.9918     -8.50466  -16.7253     10.7143       -18.5143     -2.83639    -7.44774    20.3162     21.9342     11.5031   
  -7.28199    -4.51168    -4.77706     1.99367   11.8847     -2.24961        6.93923    19.1472     -2.97989    11.0127     -9.27231   -12.1461   
   2.60148    -8.7429    -13.6254     -4.60391   -4.28139     9.45661       -8.99027    -8.84954    17.4719      5.47737    -0.729445   -0.889929 
   7.92216    16.7981    -25.3232      7.79537   -5.79957    15.1526         1.85937    -5.81069     6.26211     6.36528   -10.2083    -13.9899   
   0.979052   -3.23154    22.9183      1.58329   17.2893     -4.77779   …   -1.71329    -0.345357   10.1351     11.725      17.9994      4.13137  
 -12.4654     -0.792605   -6.28939    11.8703     5.06421   -13.3586        17.5039      4.64551     0.710493  -13.2521      5.19574     4.90359  
   8.91004    13.639       3.06454     6.00014  -12.2289     20.0813        -2.97452    -0.23808     4.63956    -2.57015     1.46739   -23.3573   
 -16.5464     -2.53216    -1.79439     8.16519   -1.01636    -7.23788        5.60839    11.4347      3.56887     2.96409    -9.67572    -7.79263  
  16.0071     43.5603      4.85504    23.9241   -18.9683     10.9092        20.7221    -22.2934     -6.03801     9.84488    -8.57055   -22.9826   
  -5.39735    -6.35583     5.50225    -7.39742   -4.2174     -3.1449    …    9.08321    17.6362    -24.2872    -12.1334     -9.25208    18.5631   
  -0.732309   -1.07189     1.65641    14.5927    -6.79831    15.3076        21.3605     -1.77275     7.6672    -16.556     -31.9151     -6.92279  
  12.066     -17.7769     -4.3234     26.788     16.565      -6.90403      -12.7472     20.2385     -8.46922     5.89168     1.0592      7.93109  
  11.4157     -5.31638    -0.804631   -5.265     -7.54981    12.7636         5.82927    -2.67509    13.0376     11.4717     -7.15938     4.45602  
 -12.1544      7.99333    13.4734    -18.5341    -3.25063     4.44092       16.3537      8.65397     6.89471    -5.06701     0.826184   -2.90417  
   ⋮                                                          ⋮         ⋱    ⋮                                                           ⋮        
   3.99748   -11.4429      2.66474   -24.4344    -2.51918    -1.79083        1.51211    15.0584    -23.3553      0.185678   13.1895     21.9612   
   6.95632    18.9656    -28.5299     12.5857    17.2015      8.48111       -7.30871    -0.775281   10.3668      2.97811     0.366781  -15.2664   
   0.95553   -11.2707      1.44373    10.3198   -15.4154     -4.585     …   17.7993     23.9214      3.86748     8.52783    17.1222      6.19255  
   7.9361      4.20605     9.85275   -11.5198     2.57681    -5.64149       -4.27823    -8.12214    -8.17309    -6.81521    -8.3202     -8.26316  
  -6.06597     3.94461    -9.33761     3.13537    8.77722     1.36968       13.7093     11.4851      1.92219    -3.03784   -16.6991     -4.27387  
  -5.22723   -11.1999     -3.65735    -0.39027   14.0611     -5.49591        8.30515     2.11561    -0.085943    2.98077    -9.54455    -1.59257  
 -17.5957     14.5879     -3.11054    -3.93381   -3.86949    15.4347        17.1275     16.5219     22.5019      5.06741   -12.1815      5.4818   
  14.2494      3.93928     0.326678   -1.06051    5.44846    23.2466    …   -9.01874    -3.71429    15.6083     16.0184     -4.47404    -3.55087  
  -1.22372     7.79477    21.0122     20.1065    12.6066     -4.94802        5.17584    10.2214    -16.1313      7.29539     4.35832     6.49918  
   8.16857     2.32976    -8.22107     8.23127  -10.5885     -1.98484        0.685584   -8.93469    13.0756     -5.6306      2.49534    -8.93337  
   9.36717   -10.5582    -12.3869      2.48226    0.942078  -14.5936        -4.79657   -15.6375     -1.2756      0.011122   22.9251     18.1269   
   0.171453   15.7406     -3.94504     6.73699  -27.3994     19.8036        19.7026     -5.57216    21.47       10.456      -7.53643     6.64545  
  -1.82776    -3.34233    -8.33704   -11.4722    -1.40182     0.569771  …    4.55985    13.976      -4.01863    -4.04004    -6.5058      1.75928  
 -13.3649     -0.879979    8.87948   -16.163      0.670733    0.160372      10.6509     14.8881      8.81195    -9.0987      1.53641    25.0194   
  -1.51149    12.2888     13.5032      4.0348   -10.9137     -2.1944        11.5437    -13.8045     15.9694     -5.84135   -14.469      -6.024    

julia> mul!(C128_126, A128_128, B128_126)
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
  -7.01447    -2.98439    -5.74117    -9.92775   -1.63882   -17.4866    …    2.50444   -15.7145     -2.52175    -9.72732     8.23362    -0.0480611
 -13.3085      8.11033    11.9918     -8.50466  -16.7253     10.7143       -18.5143     -2.83639    -7.44774    20.3162     21.9342     11.5031   
  -7.28199    -4.51168    -4.77706     1.99367   11.8847     -2.24961        6.93923    19.1472     -2.97989    11.0127     -9.27231   -12.1461   
   2.60148    -8.7429    -13.6254     -4.60391   -4.28139     9.45661       -8.99027    -8.84954    17.4719      5.47737    -0.729445   -0.889929 
   7.92216    16.7981    -25.3232      7.79537   -5.79957    15.1526         1.85937    -5.81069     6.26211     6.36528   -10.2083    -13.9899   
   0.979052   -3.23154    22.9183      1.58329   17.2893     -4.77779   …   -1.71329    -0.345357   10.1351     11.725      17.9994      4.13137  
 -12.4654     -0.792605   -6.28939    11.8703     5.06421   -13.3586        17.5039      4.64551     0.710493  -13.2521      5.19574     4.90359  
   8.91004    13.639       3.06454     6.00014  -12.2289     20.0813        -2.97452    -0.23808     4.63956    -2.57015     1.46739   -23.3573   
 -16.5464     -2.53216    -1.79439     8.16519   -1.01636    -7.23788        5.60839    11.4347      3.56887     2.96409    -9.67572    -7.79263  
  16.0071     43.5603      4.85504    23.9241   -18.9683     10.9092        20.7221    -22.2934     -6.03801     9.84488    -8.57055   -22.9826   
  -5.39735    -6.35583     5.50225    -7.39742   -4.2174     -3.1449    …    9.08321    17.6362    -24.2872    -12.1334     -9.25208    18.5631   
  -0.732309   -1.07189     1.65641    14.5927    -6.79831    15.3076        21.3605     -1.77275     7.6672    -16.556     -31.9151     -6.92279  
  12.066     -17.7769     -4.3234     26.788     16.565      -6.90403      -12.7472     20.2385     -8.46922     5.89168     1.0592      7.93109  
  11.4157     -5.31638    -0.804631   -5.265     -7.54981    12.7636         5.82927    -2.67509    13.0376     11.4717     -7.15938     4.45602  
 -12.1544      7.99333    13.4734    -18.5341    -3.25063     4.44092       16.3537      8.65397     6.89471    -5.06701     0.826184   -2.90417  
   ⋮                                                          ⋮         ⋱    ⋮                                                           ⋮        
   3.99748   -11.4429      2.66474   -24.4344    -2.51918    -1.79083        1.51211    15.0584    -23.3553      0.185678   13.1895     21.9612   
   6.95632    18.9656    -28.5299     12.5857    17.2015      8.48111       -7.30871    -0.775281   10.3668      2.97811     0.366781  -15.2664   
   0.95553   -11.2707      1.44373    10.3198   -15.4154     -4.585     …   17.7993     23.9214      3.86748     8.52783    17.1222      6.19255  
   7.9361      4.20605     9.85275   -11.5198     2.57681    -5.64149       -4.27823    -8.12214    -8.17309    -6.81521    -8.3202     -8.26316  
  -6.06597     3.94461    -9.33761     3.13537    8.77722     1.36968       13.7093     11.4851      1.92219    -3.03784   -16.6991     -4.27387  
  -5.22723   -11.1999     -3.65735    -0.39027   14.0611     -5.49591        8.30515     2.11561    -0.085943    2.98077    -9.54455    -1.59257  
 -17.5957     14.5879     -3.11054    -3.93381   -3.86949    15.4347        17.1275     16.5219     22.5019      5.06741   -12.1815      5.4818   
  14.2494      3.93928     0.326678   -1.06051    5.44846    23.2466    …   -9.01874    -3.71429    15.6083     16.0184     -4.47404    -3.55087  
  -1.22372     7.79477    21.0122     20.1065    12.6066     -4.94802        5.17584    10.2214    -16.1313      7.29539     4.35832     6.49918  
   8.16857     2.32976    -8.22107     8.23127  -10.5885     -1.98484        0.685584   -8.93469    13.0756     -5.6306      2.49534    -8.93337  
   9.36717   -10.5582    -12.3869      2.48226    0.942078  -14.5936        -4.79657   -15.6375     -1.2756      0.011122   22.9251     18.1269   
   0.171453   15.7406     -3.94504     6.73699  -27.3994     19.8036        19.7026     -5.57216    21.47       10.456      -7.53643     6.64545  
  -1.82776    -3.34233    -8.33704   -11.4722    -1.40182     0.569771  …    4.55985    13.976      -4.01863    -4.04004    -6.5058      1.75928  
 -13.3649     -0.879979    8.87948   -16.163      0.670733    0.160372      10.6509     14.8881      8.81195    -9.0987      1.53641    25.0194   
  -1.51149    12.2888     13.5032      4.0348   -10.9137     -2.1944        11.5437    -13.8045     15.9694     -5.84135   -14.469      -6.024    

The benchmark:

julia> @benchmark jmul!($C128_126, $A128_128, $B128_126) #Pure Julia
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     40.205 μs (0.00% GC)
  median time:      40.398 μs (0.00% GC)
  mean time:        41.286 μs (0.00% GC)
  maximum time:     89.556 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C128_126, $A128_128, $B128_126) #OpenBLAS
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     100.334 μs (0.00% GC)
  median time:      100.649 μs (0.00% GC)
  mean time:        102.273 μs (0.00% GC)
  maximum time:     174.168 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Looking at a very small size, multiplying 16x32 * 32x14 matrices takes about 140ns. (I said 65 ns earlier, but that was because of a bug.)
Perfect O(n^3) scaling would thus give us

julia> (128 / 16) * (128 / 32) * (128 / 14) * 140 / 1e3
40.96

microseconds, so more or less perfect scaling.
MKL had a minimum of 36.549 μs, so not only are they a bit faster, but maybe my kernel is a little suboptimal?
That would imply roughly 125ns for the kernel, so they must be doing something a little different.

Bigger matrix set up (and answer confirmation):

julia> A800_900 = mrandn(16*50, 900);

julia> B900_840 = mrandn(900,14*60);

julia> C800_840 = mrandn(16*50,14*60);

julia> jmul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
  38.0743    -37.3638     -23.3069     43.6012   -14.9798   -22.6108   …    6.68674   31.0339    22.0678    -25.335    -12.3134    34.2845    
  -0.771707  -24.4486     -46.6449     37.0407    24.6849   -22.056         4.19191    6.73854  -18.4685     90.6574    34.1527   -41.9439    
  -4.7973    -56.2395      -0.369894   21.4934   -30.8941   -20.8357       21.4338    11.461    -26.1999     -8.51833   22.5552   -52.3642    
  21.6411     -4.72307     47.6128     16.3918    29.8575   -22.1772        6.87733    3.47535  -41.9409     -3.14536   39.851      6.89992   
  16.9652     -0.0380062   90.3321    -48.3799    55.2641     5.03981      -8.4407     9.44483  -10.5455     11.4134    22.1736    28.6386    
   1.10867    32.7442      -9.571      -4.61297   26.8078    35.2818   …  -25.8284    -1.502      0.323376   50.3406    -9.49194  -12.6995    
  53.7033     -2.69979     29.6596     21.8752   -42.6504   -56.8507      -44.3737   -18.3475    -9.93175   -26.6298    -7.92967   33.2311    
 -31.6189     14.6708       7.66491    34.2681    22.136    -34.5229       20.1327     1.23482   11.1025     16.8833   -20.2968    51.6909    
 -15.412     -11.2317      23.9317     24.7645   -29.6448   -13.797         3.52927   -8.74718    4.65412    42.6386    -6.51217    0.00954396
   3.85964   -28.4584     -33.407     -20.6061   -10.6759    -0.94623       4.65821    7.54115  -36.1106    -11.7496    11.2209    -1.79995   
   2.96696    46.217       -1.40911    75.0949    27.5513    -3.66848  …  -36.9966   -45.3326    16.136     -21.0145     5.47982  -11.8452    
  41.6961     25.8745      29.335       4.15438  -45.8787   -44.9912       67.5224   -24.6367    56.7467     10.4153   -10.4572   -18.5692    
  21.3235    -30.0029     -31.4225    -38.153    -29.2       37.1059       46.3853     6.42264    1.21527   -61.9367     1.3304    18.854     
  -3.50268    12.5908      47.5534      4.62536  -62.8162    -5.87803      24.3344   -64.3619   -54.6748    -18.0393     1.03131   -7.18129   
   3.10441    17.5996     -10.1619      6.87622   -1.52725  -25.9976      -17.2381    44.9151   -30.6112     36.949     25.7487    -4.64311   
   ⋮                                                          ⋮        ⋱               ⋮                                                      
  -0.25848   -26.5611     -21.7325     28.1408    -1.25443   25.9933   …   15.8921    42.1656   -34.5233    -56.8323   -24.6462    47.475     
   2.97608   -20.6029      38.6915      6.64966   45.6482    19.4757       20.9733   -24.903     21.1787    -30.8994   -16.2787   -40.509     
   4.02883   -34.8826      -3.34805   -38.1582    40.5341    64.5054       50.0794   -34.1333    -2.95293    18.2688   -11.8533   -38.5013    
  12.4209     -3.19724    -33.9069     36.8744   -28.0702    53.2995       13.0724    53.7317    14.6668     49.303     10.8247    51.3291    
 -18.5631    -24.0589      -6.52054    34.7071    22.1703   -18.5829       -3.24598   10.1572   -38.5507    -34.5536   -59.9946   -20.7777    
 -25.0644     18.3918     -36.5186     10.7124    82.628    -44.118    …  -39.3714     1.20289  -21.4417     22.7627     3.55199  -36.0679    
  15.1521    -24.1379     -18.4929    -29.8979    26.7423    24.24        -18.1427     9.34076   81.6854      1.93507  -68.5226   -30.7109    
  12.0685      2.69166     15.5109    -10.6751    13.3148    13.6762      -22.8099    34.7844     6.12369   -56.9632    25.2193    72.3292    
 -27.2328    -45.9552      30.7724     37.1681     6.97637    7.16855      -7.0608   -31.0684   -20.0999     -6.20643   27.8918   -41.7873    
  24.3027     69.0994       1.83236   -10.791      1.16588   15.7833       14.2337   -37.2954    12.9956     65.0574    23.6132    65.8518    
 -22.6034     -3.46183     21.5065    -73.0987    41.7081   -50.3692   …   42.7244    14.4162   -55.0306     26.7348    94.2211   -24.3135    
  30.195       8.71624    -11.5629    -17.1744    24.1736   -77.5655       69.8622     1.56408   27.9262    -12.5686    64.3182   -13.0658    
   4.81069   -40.5611     -28.4828      6.60299   59.8823    16.0372        8.16483   53.9061    -6.52424     3.28285  -35.407    -56.1281    
 -46.878       8.07851     23.708     -18.5416    44.5693    -7.28137      -5.50011  -15.2032    17.4343     39.6803    17.2223    -5.02152   
  11.9364      8.46358    -33.1727    -18.1967   -36.1323     1.66343       6.10104   28.2232    14.3468     58.7046    -7.21755  -75.5267    

julia> mul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
  38.0743    -37.3638     -23.3069     43.6012   -14.9798   -22.6108   …    6.68674   31.0339    22.0678    -25.335    -12.3134    34.2845    
  -0.771707  -24.4486     -46.6449     37.0407    24.6849   -22.056         4.19191    6.73854  -18.4685     90.6574    34.1527   -41.9439    
  -4.7973    -56.2395      -0.369894   21.4934   -30.8941   -20.8357       21.4338    11.461    -26.1999     -8.51833   22.5552   -52.3642    
  21.6411     -4.72307     47.6128     16.3918    29.8575   -22.1772        6.87733    3.47535  -41.9409     -3.14536   39.851      6.89992   
  16.9652     -0.0380062   90.3321    -48.3799    55.2641     5.03981      -8.4407     9.44483  -10.5455     11.4134    22.1736    28.6386    
   1.10867    32.7442      -9.571      -4.61297   26.8078    35.2818   …  -25.8284    -1.502      0.323376   50.3406    -9.49194  -12.6995    
  53.7033     -2.69979     29.6596     21.8752   -42.6504   -56.8507      -44.3737   -18.3475    -9.93175   -26.6298    -7.92967   33.2311    
 -31.6189     14.6708       7.66491    34.2681    22.136    -34.5229       20.1327     1.23482   11.1025     16.8833   -20.2968    51.6909    
 -15.412     -11.2317      23.9317     24.7645   -29.6448   -13.797         3.52927   -8.74718    4.65412    42.6386    -6.51217    0.00954396
   3.85964   -28.4584     -33.407     -20.6061   -10.6759    -0.94623       4.65821    7.54115  -36.1106    -11.7496    11.2209    -1.79995   
   2.96696    46.217       -1.40911    75.0949    27.5513    -3.66848  …  -36.9966   -45.3326    16.136     -21.0145     5.47982  -11.8452    
  41.6961     25.8745      29.335       4.15438  -45.8787   -44.9912       67.5224   -24.6367    56.7467     10.4153   -10.4572   -18.5692    
  21.3235    -30.0029     -31.4225    -38.153    -29.2       37.1059       46.3853     6.42264    1.21527   -61.9367     1.3304    18.854     
  -3.50268    12.5908      47.5534      4.62536  -62.8162    -5.87803      24.3344   -64.3619   -54.6748    -18.0393     1.03131   -7.18129   
   3.10441    17.5996     -10.1619      6.87622   -1.52725  -25.9976      -17.2381    44.9151   -30.6112     36.949     25.7487    -4.64311   
   ⋮                                                          ⋮        ⋱               ⋮                                                      
  -0.25848   -26.5611     -21.7325     28.1408    -1.25443   25.9933   …   15.8921    42.1656   -34.5233    -56.8323   -24.6462    47.475     
   2.97608   -20.6029      38.6915      6.64966   45.6482    19.4757       20.9733   -24.903     21.1787    -30.8994   -16.2787   -40.509     
   4.02883   -34.8826      -3.34805   -38.1582    40.5341    64.5054       50.0794   -34.1333    -2.95293    18.2688   -11.8533   -38.5013    
  12.4209     -3.19724    -33.9069     36.8744   -28.0702    53.2995       13.0724    53.7317    14.6668     49.303     10.8247    51.3291    
 -18.5631    -24.0589      -6.52054    34.7071    22.1703   -18.5829       -3.24598   10.1572   -38.5507    -34.5536   -59.9946   -20.7777    
 -25.0644     18.3918     -36.5186     10.7124    82.628    -44.118    …  -39.3714     1.20289  -21.4417     22.7627     3.55199  -36.0679    
  15.1521    -24.1379     -18.4929    -29.8979    26.7423    24.24        -18.1427     9.34076   81.6854      1.93507  -68.5226   -30.7109    
  12.0685      2.69166     15.5109    -10.6751    13.3148    13.6762      -22.8099    34.7844     6.12369   -56.9632    25.2193    72.3292    
 -27.2328    -45.9552      30.7724     37.1681     6.97637    7.16855      -7.0608   -31.0684   -20.0999     -6.20643   27.8918   -41.7873    
  24.3027     69.0994       1.83236   -10.791      1.16588   15.7833       14.2337   -37.2954    12.9956     65.0574    23.6132    65.8518    
 -22.6034     -3.46183     21.5065    -73.0987    41.7081   -50.3692   …   42.7244    14.4162   -55.0306     26.7348    94.2211   -24.3135    
  30.195       8.71624    -11.5629    -17.1744    24.1736   -77.5655       69.8622     1.56408   27.9262    -12.5686    64.3182   -13.0658    
   4.81069   -40.5611     -28.4828      6.60299   59.8823    16.0372        8.16483   53.9061    -6.52424     3.28285  -35.407    -56.1281    
 -46.878       8.07851     23.708     -18.5416    44.5693    -7.28137      -5.50011  -15.2032    17.4343     39.6803    17.2223    -5.02152   
  11.9364      8.46358    -33.1727    -18.1967   -36.1323     1.66343       6.10104   28.2232    14.3468     58.7046    -7.21755  -75.5267    

Benchmark:

julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.793 ms (0.00% GC)
  median time:      17.102 ms (0.00% GC)
  mean time:        17.165 ms (0.00% GC)
  maximum time:     19.265 ms (0.00% GC)
  --------------
  samples:          292
  evals/sample:     1

julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.235 ms (0.00% GC)
  median time:      23.628 ms (0.00% GC)
  mean time:        23.698 ms (0.00% GC)
  maximum time:     25.558 ms (0.00% GC)
  --------------
  samples:          211
  evals/sample:     1

We’re still faster than OpenBLAS!
Perfect scaling would give us:

julia> (800 / 16) * (900 / 32) * (840 / 14) * 140 / 1e6
11.8125

Just over 11.8 milliseconds – we’re now around 50% slower.
MKL does it in 11.5 ms, so still better than the scaled up kernel, but only slightly so.

EDIT:
Ryzen results:

julia> @benchmark jmul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     137.429 μs (0.00% GC)
  median time:      138.641 μs (0.00% GC)
  mean time:        139.844 μs (0.00% GC)
  maximum time:     262.424 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     158.137 μs (0.00% GC)
  median time:      163.022 μs (0.00% GC)
  mean time:        163.917 μs (0.00% GC)
  maximum time:     215.655 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     43.583 ms (0.00% GC)
  median time:      44.016 ms (0.00% GC)
  mean time:        44.166 ms (0.00% GC)
  maximum time:     46.266 ms (0.00% GC)
  --------------
  samples:          114
  evals/sample:     1

julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     41.944 ms (0.00% GC)
  median time:      42.194 ms (0.00% GC)
  mean time:        42.289 ms (0.00% GC)
  maximum time:     43.389 ms (0.00% GC)
  --------------
  samples:          119
  evals/sample:     1

A little slower than OpenBLAS with Ryzen at the larger size, but pretty close.

So:

So given the above, I am led to believe that a pure-Julia blas-like implementations might be realistic for things like matrix-matrix multiplication.

Definitely! Already highly competitive with OpenBLAS on the two architectures I tested. Significantly faster on one, and barely slower on the other. I think the memory can be much better optimized than it is now (ie, clever recursive loading of submatrices). However, I am worried that the optimal memory pre-fetching behavior may be very architecture dependent.
Threading would also probably be difficult? I have no idea.

The bottom curve is for the matrix multiplication carried out for MMatrices (up to size 10). EDIT: one of the matrices (gradN) is an MMatrix always.

Given that, you could definitely improve my function a lot. API-wise, get the number of columns from gradN, instead of passing it as a Val. For another, we could split the outer loop into a variable number of pieces based on Kedim, with L changing for each piece as appropriate.
Or, as a lazy hack, just have it call the other function when Kedim < some threshold.

6 Likes

I see.
Maybe @kristoffer.carlsson can assist with that?
We’re truing to employ Intel MKL in Low Overhead Mode in Julia.
Any chance you have idea what @Elrod should do?

This looks really cool! Thanks for the pointer, I will need to take some time to wrap my head around it.