Did you disabled multi threading inside the BLAS routines? I noticed that at least in 0.6 there is a significant overhead if BLAS tries to multi thread.
No, I haven’t tried disabling multithreading.
My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
I’m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.
julia> BLAS.set_num_threads(1)
julia> gradN = rand(60, 40);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
11.014 μs (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
8.425 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
9.289 μs (0 allocations: 0 bytes)
julia> BLAS.set_num_threads(10)
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
10.994 μs (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
8.420 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
9.277 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
37.251 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
52.913 μs (0 allocations: 0 bytes)
Times were the same, and I never saw CPU usage exceed 100%.
Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldn’t matter whether you built it from source or not.
BTW, @PetrKryslUCSD, if you’re primarily concerned about small sizes and willing to use StaticArrays:
julia> using StaticArrays
julia> gradN = @MMatrix rand(3, 2);
julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
5.714 ns (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
8.696 ns (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
8.697 ns (0 allocations: 0 bytes)
MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, @MMatrix randn(m, n)
) aren’t optimized for big matrices, so lots of things would randomly be painfully slow.
This includes BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke)
, which eventually becomes the fastest method. (In my application gradN
is not likely to be bigger than 64x3.)
Ah, okay. I think the skinniness of your matrix is part of the problem.
I would’ve called a 14x14 matrix “small”, and that has more elements than the 64x3. (Although X * X’ would be much larger for the latter).
3 may also be an awkward width. Eg, on my computer, BLAS takes about the same long for 64x3 as it does for 64x4, while the loop versions are of course about 33% faster for the former:
julia> gradN = rand(64, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
6.355 μs (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
2.933 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
2.800 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
3.731 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
4.930 μs (0 allocations: 0 bytes)
julia> gradN = rand(64, 4);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime $Ke .= $mult .* mul!($Ke, $gradN, $gradN');
6.355 μs (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
3.100 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
2.981 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.935 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only_wo!($Ke, $gradN, $mult);
6.628 μs (0 allocations: 0 bytes)
Very interesting data. Thanks! And thanks for your suggestions. A lot of food for thought…
Are you using Julia + Intel MKL?
If you use OpenBLAS you’re leaving a lot of performance on the table.
You should try using Intel MKL.
I wonder if Julia uses Intel MKL with low Overhead (Intel MKL has a calling method which has less overhead and gets great performance for small matrices).
No MKL. I am on 0.7 at the moment.
Here, I am comparing to MKL:
julia> using LinearAlgebra
julia> BLAS.vendor()
:mkl
julia> versioninfo()
Julia Version 0.7.0-beta.200
Commit 9277d3af2d (2018-07-07 19:47 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Currently, SIMD.jl is broken on master: https://github.com/JuliaLang/julia/issues/27694
So pardon this work around:
struct Vec{N,T} <: AbstractVector{T}
# data::NTuple{N,Core.VecElement{T}}
data::NTuple{N,T}
Vec(data::Vararg{T,N}) where {N,T} = new{N,T}(data)
end
@inline Base.getindex(x::Vec, i) = x.data[i]
@inline Base.length(::Vec{N}) where N = N
@inline Base.size(::Vec{N}) where N = (N,)
@inline Base.eltype(::Vec{N,T}) where {N,T} = T
@inline function vload(::Type{Vec{N,T}}, x::AbstractArray{T}, i) where {N,T}
unsafe_load(Base.unsafe_convert(Ptr{Vec{N,T}}, pointer(x)) + sizeof(T)*i)
end
@inline function vstore!(x::AbstractArray{T}, v::Vec{N,T}, i) where {N,T}
unsafe_store!(Base.unsafe_convert(Ptr{Vec{N,T}}, pointer(x)) + sizeof(T)*i, v)
end
function create_quote()
q = quote
@inbounds @fastmath begin
$(Expr(:meta, :inline))
Vec()
end
end
q, q.args[2].args[3].args[3].args[4].args
end
@generated function Base.fma(x::Vec{N,T}, y::T, z::Vec{N,T}) where {N,T}
q, qa = create_quote()
for n ∈ 1:N
push!(qa, :(x[$n] * y + z[$n]))
end
q
end
LLVM is good enough at autovectorizing that we don’t really need to be explicit anyway.
Here is a new version that uses the above functions; I probably should have just called it Syrk:
@generated function add_mggt_ut_only!(Ke, gradN, mult, ::Val{mdim}) where mdim
L = 8
quote
@assert size(Ke, 1) == size(Ke, 2)
Kedim = size(Ke, 1)
# nne, mdim = size(gradN)
V = Vec{$L,Float64}
SIMD_loop_max = Kedim ÷ $L * $L -1
@inbounds for c = 1:Kedim# ÷ $L * $L
Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
for r = 0:$L:min(c-1,SIMD_loop_max)
Ktemp = vload(V, Ke, r + (c-1)*Kedim)
Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(vload(V, gradN, r+(m-1)*Kedim), gradNc_m, Ktemp))
vstore!(Ke, Ktemp, r + (c-1)*Kedim)
end
end
remainder = Kedim % $L
if remainder > 0
bound = Kedim-remainder+1
@inbounds for c = bound:Kedim
Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
for r = bound:c
Ktemp = Ke[r,c]
Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(gradN[r,m], gradNc_m, Ktemp))
Ke[r,c] = Ktemp
end
end
end
return true
end
end
It’s kind of naive, and I wrote it assuming that mdim
is very small, For example, 3.
It also gives jagged results.
The upper triangle is complete, but there’s lots of stuff extending into the lower triangle. The numbers there aren’t wrong, I just didn’t bother continuing in the lower triangle any further beyond completing a vector.
It’s the version taking a Val
argument, for the number of columns in gradN
:
julia> using LinearAlgebra, BenchmarkTools
julia> mult = 1.4
1.4
julia> gradN = rand(64, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.397 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
688.748 ns (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
1.425 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
713.936 ns (0 allocations: 0 bytes)
julia> gradN = rand(63, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.262 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
780.188 ns (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
1.555 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
907.872 ns (0 allocations: 0 bytes)
julia> gradN = rand(65, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.521 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
853.724 ns (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
1.679 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
1.094 μs (0 allocations: 0 bytes)
EDIT:
Comparing to OpenBLAS:
julia> mult = 1.4
1.4
julia> gradN = rand(64, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.396 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
741.113 ns (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
2.922 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
2.778 μs (0 allocations: 0 bytes)
julia> gradN = rand(63, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.255 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
745.946 ns (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
2.993 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
2.909 μs (0 allocations: 0 bytes)
julia> gradN = rand(65, 3);
julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult);
4.522 μs (0 allocations: 0 bytes)
julia> @btime add_mggt_ut_only!($Ke, $gradN, $mult, Val(3));
705.270 ns (0 allocations: 0 bytes)
julia> @btime BLAS.syrk!('U', 'N', $mult, $gradN, 0.0, $Ke);
3.076 μs (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', $mult, $gradN, $gradN, 0.0, $Ke);
2.996 μs (0 allocations: 0 bytes)
Very cool!
No, apparently not:
The link does say that for C/C++, all you have to do is add -DMKL_DIRECT_CALL
or -DMKL_DIRECT_CALL_SEQ
. So, maybe we could try adding one of those to MKL_LDFLAGS
and building from source later to test it?
Not sure if that’s correct / that’d work.
That would be great of you could pull it out and we’ll be able to see what are the gains made.
Your code is clearly the fastest in my tests.
For your reference: this is how the curves were generated
twi = @belapsed add_mggt_ut_only!($Ke, $gradN, 1.0)
tsn = @belapsed 1.0 * ($gradN*Transpose($gradN))
tmn = @belapsed 1.0 * mul!($Ke, $gradN, Transpose($gradN))
two = @belapsed add_mggt_ut_only_wo!($Ke, $gradN, 1.0)
tsd = @belapsed 1.0 .* ($gradN*Transpose($gradN))
tmd = @belapsed 1.0 .* mul!($Ke, $gradN, Transpose($gradN))
tgd = @belapsed BLAS.gemm!('N', 'T', 1.0, $gradN, $gradN, 0.0, $Ke)
tw2 = @belapsed add_mggt_ut_only_g!($Ke, $gradN, 1.0, Val(2))
Performance of the generated code is excellent, I only wish it was easier to read
What is striking on the graph that includes the generated-loops function of @Elrod is that it yields excellent performance both for very small matrices and for large matrices. This is to be contrasted with gemm!
, which apparently has a significant overhead that eventually gets amortized, but is nevertheless significant for small matrix sizes. (Curiously the plain matrix-matrix multiplication A*B
, and the mul!
function both seem to have an overhead on top of gemm!
. I don’t know yet where. @andreasnoack perhaps might be able to help?)
So given the above, I am led to believe that a pure-Julia blas-like implementations might be realistic for things like matrix-matrix multiplication. For larger matrices the packing to facilitate cache access are the smart thing to do, but the packing should probably be turned off for small matrices. BLAS cannot do that: hence the disappointing performance for small matrices. That’s my thinking, and I wonder if the experts would care to chime in?
One more update: the matrices can be StaticArrays (MMatrix). Unfortunately, only for small matrices. If one could decide dynamically (with type stability) whether it should be an MMatrix or a regular matrix, the matrix multiplication could be performed even faster for small matrices:
The bottom curve is for the matrix multiplication carried out for MMatrices (up to size 10). EDIT: one of the matrices (gradN) is an MMatrix always. The square matrix Ke is an MMatrix only up to size 10x10.
I compiled it with the flag -DMKL_DIRECT_CALL
added, but did not see any difference. So I am guessing that isn’t how you’re supposed to do that. Can’t say I’m very familiar with the build system…
@PetrKryslUCSD
Yeah, I need to work on the readability of the code.
I should spend some time eventually to come up with a bunch of abstractions, and then either try to implement them as macros, or at least as functions that modify a quote in a clear way, to try and make it easier to read, write, and maintain.
In that particular case, L
is the unrolled length. If Kedim
is large, you would want that to be 4x your vector length. For a computer with avx, that’d be 44 = 16, and with avx512, that’d be 84 = 32.
I chose 8, because that seemed better for Kedim
values of around 64.
Of course, if we knew Kedim at compile time, we could just choose appropriately – or, ideally, simplify the expression a bit so that the compiler automatically chooses at compile time. (Remember on one of my computers, the compiler chose 32 for the @inbounds
version, one of the reasons it was slow at small sizes).
Whenever you see @nexprs
or Base.Cartesian.@nexprs
, you can think of that as an unrolled for loop.
julia> using Base.Cartesian
julia> @macroexpand @nexprs 4 i -> w[i] = x[i] * y[1+5*(i-1)] + z[i]
quote
w[1] = x[1] * y[1] + z[1]
w[2] = x[2] * y[6] + z[2]
w[3] = x[3] * y[11] + z[3]
w[4] = x[4] * y[16] + z[4]
end
Basic explanation of what this is doing / what I was thinking:
@inbounds for c = 1:Kedim# ÷ $L * $L
Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
for r = 0:$L:min(c-1,SIMD_loop_max)
Ktemp = vload(V, Ke, r + (c-1)*Kedim)
Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(vload(V, gradN, r+(m-1)*Kedim), gradNc_m, Ktemp))
vstore!(Ke, Ktemp, r + (c-1)*Kedim)
end
end
An element of Ke
equals the dot product of two rows of gradN
. While avx512
and GPUs can “gather” various noncontigous elements into a vector for SIMD-style operations, normal avx cannot (and I think it is slow for avx512). What is fast is loading a chunk of numbers one after the other, or broadcasting a single number across a register (ie, loading one number into every element).
a = [1.0, 2.0, 3.0, 4.0]
So for avx, where we can load 4 doubles at a time, we could either load 1.,2.,3.,4.
, or we could fill a register with 4 copies of any of these numbers, like 3.,3.,3.,3.
. Similar story for storage – we can efficiently store a vector contiguously, but splitting it up is slow. So, let’s take a look at Ke
Ke += [gradN[1,:]'*gradN[1,:] gradN[1,:]'*gradN[2,:] ...;
gradN[2,:]'*gradN[1,:] gradN[2,:]'*gradN[2,:] ... ;
gradN[3,:]'*gradN[1,:] gradN[3,:]'*gradN[2,:] ... ;
gradN[4,:]'*gradN[1,:] gradN[4,:]'*gradN[2,:] ... ;]
So, to calculate the first 4 elements of column 1 of Ke, what we want is:
Ke[1:4,1] .+= gradN[1:4,1] .* gradN[1,1] .+ gradN[1:4,2] .* gradN[1,2] .+ gradN[1:4,3] .* gradN[1,3] .+...
Keeping that in mind, the loops, in order from outer to inner:
- Fills in 1 column of Ke at a time (but we stop as soon as we both finish the upper triangle, and complete a vector). Here, we pull out the
mdim
elementsgradN[c,1]
,gradN[c,2]
, …,gradN[c,mdim]
. (The compiler broadcasts automatically). I also moved the multiplication bymult
here.
Base.Cartesian.@nexprs $mdim m -> gradNc_m = gradN[c, m] * mult
- Loads one set of vectors (a total of
L
elements) at a time from columnc
. With L=8 and avx2, that would begradN[1:4,c]
andgradN[5:8,c]
. I don’t have to be explicit about the actual length, it just should be some multiple of the SIMD length. Multiple operations can overlap on the same core, so doing 16 will not take 4x longer than doing 4. But if Kedim == 29, L==4 takes care of all but the last (29%4 == 1), while it leaves a whole lot of non-vectorized elements if L == 16.
How much we actually want to unroll it depends on how big we think it is. Realistically, because we’re only doing the upper triangle, it might make sense to split the outerc
loop into pieces, with different values of L for each one. Especially if you decide to makeKedim
known at compile time (eg, by using MMatrices), in which case any extra loops will get compiled away when unnecessary. - For the inner unrolled loop, we just use
fma
instructions across the columns
Ktemp = vload(V, Ke, r + (c-1)*Kedim) # loads Ke[(1:L) + L*iter,c]
Base.Cartesian.@nexprs $mdim m -> (Ktemp = fma(vload(V, gradN, r+(m-1)*Kedim), gradNc_m, Ktemp))
vstore!(Ke, Ktemp, r + (c-1)*Kedim) # stores Ke[(1:L) + L*iter,c]
# The inner `@nexprs` part is doing this:
Ke[1:4,1] .+= gradN[1:4,1] .* gradN[1,1] .+ gradN[1:4,2] .* gradN[1,2] .+ gradN[1:4,3] .* gradN[1,3] .+...
Yes, I think we can write a BLAS in Julia.
I’ve been playing around with it for a little while. You can check out some work here:
Major issues:
- It only works for matrices that are a multiple of the kernel size right now. I haven’t filled in the missing edges yet.
- Only works for
MMatrices
from StaticArrays. Once I finish the rest, I’ll have a non-generated version that works for regular old arrays (and has probably just a tiny bit more overhead, because of calculating some things at runtime). - I don’t have played with kernels involving matrix transposes yet.
- I haven’t implemented a recursive algorithm yet. I think if I load submatrices recursively, I might be able to scale better. Current scaling is a lot worse than
O(n^3)
on my Intel processor with avx512, but only a tiny bit worse with Ryzen. The avx512 has 4x the throughput as Ryzen per core, so throughput / memory speed is a lot worse with Intel. - No threading!
That said, it looks good! Versus OpenBLAS with avx-512, modest sized matrices:
julia> using OhMyREPL, jBLAS, BenchmarkTools, LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> A128_128 = mrandn(128,128); # WAY faster than @MMatrix randn(M, N) for M,N > TINY_NUMBER
julia> B128_126 = mrandn(128,126);
julia> C128_126 = mrandn(128,126);
julia> jmul!(C128_126, A128_128, B128_126) # Confirming the answer is correct
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
-7.01447 -2.98439 -5.74117 -9.92775 -1.63882 -17.4866 … 2.50444 -15.7145 -2.52175 -9.72732 8.23362 -0.0480611
-13.3085 8.11033 11.9918 -8.50466 -16.7253 10.7143 -18.5143 -2.83639 -7.44774 20.3162 21.9342 11.5031
-7.28199 -4.51168 -4.77706 1.99367 11.8847 -2.24961 6.93923 19.1472 -2.97989 11.0127 -9.27231 -12.1461
2.60148 -8.7429 -13.6254 -4.60391 -4.28139 9.45661 -8.99027 -8.84954 17.4719 5.47737 -0.729445 -0.889929
7.92216 16.7981 -25.3232 7.79537 -5.79957 15.1526 1.85937 -5.81069 6.26211 6.36528 -10.2083 -13.9899
0.979052 -3.23154 22.9183 1.58329 17.2893 -4.77779 … -1.71329 -0.345357 10.1351 11.725 17.9994 4.13137
-12.4654 -0.792605 -6.28939 11.8703 5.06421 -13.3586 17.5039 4.64551 0.710493 -13.2521 5.19574 4.90359
8.91004 13.639 3.06454 6.00014 -12.2289 20.0813 -2.97452 -0.23808 4.63956 -2.57015 1.46739 -23.3573
-16.5464 -2.53216 -1.79439 8.16519 -1.01636 -7.23788 5.60839 11.4347 3.56887 2.96409 -9.67572 -7.79263
16.0071 43.5603 4.85504 23.9241 -18.9683 10.9092 20.7221 -22.2934 -6.03801 9.84488 -8.57055 -22.9826
-5.39735 -6.35583 5.50225 -7.39742 -4.2174 -3.1449 … 9.08321 17.6362 -24.2872 -12.1334 -9.25208 18.5631
-0.732309 -1.07189 1.65641 14.5927 -6.79831 15.3076 21.3605 -1.77275 7.6672 -16.556 -31.9151 -6.92279
12.066 -17.7769 -4.3234 26.788 16.565 -6.90403 -12.7472 20.2385 -8.46922 5.89168 1.0592 7.93109
11.4157 -5.31638 -0.804631 -5.265 -7.54981 12.7636 5.82927 -2.67509 13.0376 11.4717 -7.15938 4.45602
-12.1544 7.99333 13.4734 -18.5341 -3.25063 4.44092 16.3537 8.65397 6.89471 -5.06701 0.826184 -2.90417
⋮ ⋮ ⋱ ⋮ ⋮
3.99748 -11.4429 2.66474 -24.4344 -2.51918 -1.79083 1.51211 15.0584 -23.3553 0.185678 13.1895 21.9612
6.95632 18.9656 -28.5299 12.5857 17.2015 8.48111 -7.30871 -0.775281 10.3668 2.97811 0.366781 -15.2664
0.95553 -11.2707 1.44373 10.3198 -15.4154 -4.585 … 17.7993 23.9214 3.86748 8.52783 17.1222 6.19255
7.9361 4.20605 9.85275 -11.5198 2.57681 -5.64149 -4.27823 -8.12214 -8.17309 -6.81521 -8.3202 -8.26316
-6.06597 3.94461 -9.33761 3.13537 8.77722 1.36968 13.7093 11.4851 1.92219 -3.03784 -16.6991 -4.27387
-5.22723 -11.1999 -3.65735 -0.39027 14.0611 -5.49591 8.30515 2.11561 -0.085943 2.98077 -9.54455 -1.59257
-17.5957 14.5879 -3.11054 -3.93381 -3.86949 15.4347 17.1275 16.5219 22.5019 5.06741 -12.1815 5.4818
14.2494 3.93928 0.326678 -1.06051 5.44846 23.2466 … -9.01874 -3.71429 15.6083 16.0184 -4.47404 -3.55087
-1.22372 7.79477 21.0122 20.1065 12.6066 -4.94802 5.17584 10.2214 -16.1313 7.29539 4.35832 6.49918
8.16857 2.32976 -8.22107 8.23127 -10.5885 -1.98484 0.685584 -8.93469 13.0756 -5.6306 2.49534 -8.93337
9.36717 -10.5582 -12.3869 2.48226 0.942078 -14.5936 -4.79657 -15.6375 -1.2756 0.011122 22.9251 18.1269
0.171453 15.7406 -3.94504 6.73699 -27.3994 19.8036 19.7026 -5.57216 21.47 10.456 -7.53643 6.64545
-1.82776 -3.34233 -8.33704 -11.4722 -1.40182 0.569771 … 4.55985 13.976 -4.01863 -4.04004 -6.5058 1.75928
-13.3649 -0.879979 8.87948 -16.163 0.670733 0.160372 10.6509 14.8881 8.81195 -9.0987 1.53641 25.0194
-1.51149 12.2888 13.5032 4.0348 -10.9137 -2.1944 11.5437 -13.8045 15.9694 -5.84135 -14.469 -6.024
julia> mul!(C128_126, A128_128, B128_126)
128×126 StaticArrays.MArray{Tuple{128,126},Float64,2,16128}:
-7.01447 -2.98439 -5.74117 -9.92775 -1.63882 -17.4866 … 2.50444 -15.7145 -2.52175 -9.72732 8.23362 -0.0480611
-13.3085 8.11033 11.9918 -8.50466 -16.7253 10.7143 -18.5143 -2.83639 -7.44774 20.3162 21.9342 11.5031
-7.28199 -4.51168 -4.77706 1.99367 11.8847 -2.24961 6.93923 19.1472 -2.97989 11.0127 -9.27231 -12.1461
2.60148 -8.7429 -13.6254 -4.60391 -4.28139 9.45661 -8.99027 -8.84954 17.4719 5.47737 -0.729445 -0.889929
7.92216 16.7981 -25.3232 7.79537 -5.79957 15.1526 1.85937 -5.81069 6.26211 6.36528 -10.2083 -13.9899
0.979052 -3.23154 22.9183 1.58329 17.2893 -4.77779 … -1.71329 -0.345357 10.1351 11.725 17.9994 4.13137
-12.4654 -0.792605 -6.28939 11.8703 5.06421 -13.3586 17.5039 4.64551 0.710493 -13.2521 5.19574 4.90359
8.91004 13.639 3.06454 6.00014 -12.2289 20.0813 -2.97452 -0.23808 4.63956 -2.57015 1.46739 -23.3573
-16.5464 -2.53216 -1.79439 8.16519 -1.01636 -7.23788 5.60839 11.4347 3.56887 2.96409 -9.67572 -7.79263
16.0071 43.5603 4.85504 23.9241 -18.9683 10.9092 20.7221 -22.2934 -6.03801 9.84488 -8.57055 -22.9826
-5.39735 -6.35583 5.50225 -7.39742 -4.2174 -3.1449 … 9.08321 17.6362 -24.2872 -12.1334 -9.25208 18.5631
-0.732309 -1.07189 1.65641 14.5927 -6.79831 15.3076 21.3605 -1.77275 7.6672 -16.556 -31.9151 -6.92279
12.066 -17.7769 -4.3234 26.788 16.565 -6.90403 -12.7472 20.2385 -8.46922 5.89168 1.0592 7.93109
11.4157 -5.31638 -0.804631 -5.265 -7.54981 12.7636 5.82927 -2.67509 13.0376 11.4717 -7.15938 4.45602
-12.1544 7.99333 13.4734 -18.5341 -3.25063 4.44092 16.3537 8.65397 6.89471 -5.06701 0.826184 -2.90417
⋮ ⋮ ⋱ ⋮ ⋮
3.99748 -11.4429 2.66474 -24.4344 -2.51918 -1.79083 1.51211 15.0584 -23.3553 0.185678 13.1895 21.9612
6.95632 18.9656 -28.5299 12.5857 17.2015 8.48111 -7.30871 -0.775281 10.3668 2.97811 0.366781 -15.2664
0.95553 -11.2707 1.44373 10.3198 -15.4154 -4.585 … 17.7993 23.9214 3.86748 8.52783 17.1222 6.19255
7.9361 4.20605 9.85275 -11.5198 2.57681 -5.64149 -4.27823 -8.12214 -8.17309 -6.81521 -8.3202 -8.26316
-6.06597 3.94461 -9.33761 3.13537 8.77722 1.36968 13.7093 11.4851 1.92219 -3.03784 -16.6991 -4.27387
-5.22723 -11.1999 -3.65735 -0.39027 14.0611 -5.49591 8.30515 2.11561 -0.085943 2.98077 -9.54455 -1.59257
-17.5957 14.5879 -3.11054 -3.93381 -3.86949 15.4347 17.1275 16.5219 22.5019 5.06741 -12.1815 5.4818
14.2494 3.93928 0.326678 -1.06051 5.44846 23.2466 … -9.01874 -3.71429 15.6083 16.0184 -4.47404 -3.55087
-1.22372 7.79477 21.0122 20.1065 12.6066 -4.94802 5.17584 10.2214 -16.1313 7.29539 4.35832 6.49918
8.16857 2.32976 -8.22107 8.23127 -10.5885 -1.98484 0.685584 -8.93469 13.0756 -5.6306 2.49534 -8.93337
9.36717 -10.5582 -12.3869 2.48226 0.942078 -14.5936 -4.79657 -15.6375 -1.2756 0.011122 22.9251 18.1269
0.171453 15.7406 -3.94504 6.73699 -27.3994 19.8036 19.7026 -5.57216 21.47 10.456 -7.53643 6.64545
-1.82776 -3.34233 -8.33704 -11.4722 -1.40182 0.569771 … 4.55985 13.976 -4.01863 -4.04004 -6.5058 1.75928
-13.3649 -0.879979 8.87948 -16.163 0.670733 0.160372 10.6509 14.8881 8.81195 -9.0987 1.53641 25.0194
-1.51149 12.2888 13.5032 4.0348 -10.9137 -2.1944 11.5437 -13.8045 15.9694 -5.84135 -14.469 -6.024
The benchmark:
julia> @benchmark jmul!($C128_126, $A128_128, $B128_126) #Pure Julia
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 40.205 μs (0.00% GC)
median time: 40.398 μs (0.00% GC)
mean time: 41.286 μs (0.00% GC)
maximum time: 89.556 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C128_126, $A128_128, $B128_126) #OpenBLAS
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 100.334 μs (0.00% GC)
median time: 100.649 μs (0.00% GC)
mean time: 102.273 μs (0.00% GC)
maximum time: 174.168 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Looking at a very small size, multiplying 16x32 * 32x14 matrices takes about 140ns. (I said 65 ns earlier, but that was because of a bug.)
Perfect O(n^3) scaling would thus give us
julia> (128 / 16) * (128 / 32) * (128 / 14) * 140 / 1e3
40.96
microseconds, so more or less perfect scaling.
MKL had a minimum of 36.549 μs, so not only are they a bit faster, but maybe my kernel is a little suboptimal?
That would imply roughly 125ns for the kernel, so they must be doing something a little different.
Bigger matrix set up (and answer confirmation):
julia> A800_900 = mrandn(16*50, 900);
julia> B900_840 = mrandn(900,14*60);
julia> C800_840 = mrandn(16*50,14*60);
julia> jmul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
38.0743 -37.3638 -23.3069 43.6012 -14.9798 -22.6108 … 6.68674 31.0339 22.0678 -25.335 -12.3134 34.2845
-0.771707 -24.4486 -46.6449 37.0407 24.6849 -22.056 4.19191 6.73854 -18.4685 90.6574 34.1527 -41.9439
-4.7973 -56.2395 -0.369894 21.4934 -30.8941 -20.8357 21.4338 11.461 -26.1999 -8.51833 22.5552 -52.3642
21.6411 -4.72307 47.6128 16.3918 29.8575 -22.1772 6.87733 3.47535 -41.9409 -3.14536 39.851 6.89992
16.9652 -0.0380062 90.3321 -48.3799 55.2641 5.03981 -8.4407 9.44483 -10.5455 11.4134 22.1736 28.6386
1.10867 32.7442 -9.571 -4.61297 26.8078 35.2818 … -25.8284 -1.502 0.323376 50.3406 -9.49194 -12.6995
53.7033 -2.69979 29.6596 21.8752 -42.6504 -56.8507 -44.3737 -18.3475 -9.93175 -26.6298 -7.92967 33.2311
-31.6189 14.6708 7.66491 34.2681 22.136 -34.5229 20.1327 1.23482 11.1025 16.8833 -20.2968 51.6909
-15.412 -11.2317 23.9317 24.7645 -29.6448 -13.797 3.52927 -8.74718 4.65412 42.6386 -6.51217 0.00954396
3.85964 -28.4584 -33.407 -20.6061 -10.6759 -0.94623 4.65821 7.54115 -36.1106 -11.7496 11.2209 -1.79995
2.96696 46.217 -1.40911 75.0949 27.5513 -3.66848 … -36.9966 -45.3326 16.136 -21.0145 5.47982 -11.8452
41.6961 25.8745 29.335 4.15438 -45.8787 -44.9912 67.5224 -24.6367 56.7467 10.4153 -10.4572 -18.5692
21.3235 -30.0029 -31.4225 -38.153 -29.2 37.1059 46.3853 6.42264 1.21527 -61.9367 1.3304 18.854
-3.50268 12.5908 47.5534 4.62536 -62.8162 -5.87803 24.3344 -64.3619 -54.6748 -18.0393 1.03131 -7.18129
3.10441 17.5996 -10.1619 6.87622 -1.52725 -25.9976 -17.2381 44.9151 -30.6112 36.949 25.7487 -4.64311
⋮ ⋮ ⋱ ⋮
-0.25848 -26.5611 -21.7325 28.1408 -1.25443 25.9933 … 15.8921 42.1656 -34.5233 -56.8323 -24.6462 47.475
2.97608 -20.6029 38.6915 6.64966 45.6482 19.4757 20.9733 -24.903 21.1787 -30.8994 -16.2787 -40.509
4.02883 -34.8826 -3.34805 -38.1582 40.5341 64.5054 50.0794 -34.1333 -2.95293 18.2688 -11.8533 -38.5013
12.4209 -3.19724 -33.9069 36.8744 -28.0702 53.2995 13.0724 53.7317 14.6668 49.303 10.8247 51.3291
-18.5631 -24.0589 -6.52054 34.7071 22.1703 -18.5829 -3.24598 10.1572 -38.5507 -34.5536 -59.9946 -20.7777
-25.0644 18.3918 -36.5186 10.7124 82.628 -44.118 … -39.3714 1.20289 -21.4417 22.7627 3.55199 -36.0679
15.1521 -24.1379 -18.4929 -29.8979 26.7423 24.24 -18.1427 9.34076 81.6854 1.93507 -68.5226 -30.7109
12.0685 2.69166 15.5109 -10.6751 13.3148 13.6762 -22.8099 34.7844 6.12369 -56.9632 25.2193 72.3292
-27.2328 -45.9552 30.7724 37.1681 6.97637 7.16855 -7.0608 -31.0684 -20.0999 -6.20643 27.8918 -41.7873
24.3027 69.0994 1.83236 -10.791 1.16588 15.7833 14.2337 -37.2954 12.9956 65.0574 23.6132 65.8518
-22.6034 -3.46183 21.5065 -73.0987 41.7081 -50.3692 … 42.7244 14.4162 -55.0306 26.7348 94.2211 -24.3135
30.195 8.71624 -11.5629 -17.1744 24.1736 -77.5655 69.8622 1.56408 27.9262 -12.5686 64.3182 -13.0658
4.81069 -40.5611 -28.4828 6.60299 59.8823 16.0372 8.16483 53.9061 -6.52424 3.28285 -35.407 -56.1281
-46.878 8.07851 23.708 -18.5416 44.5693 -7.28137 -5.50011 -15.2032 17.4343 39.6803 17.2223 -5.02152
11.9364 8.46358 -33.1727 -18.1967 -36.1323 1.66343 6.10104 28.2232 14.3468 58.7046 -7.21755 -75.5267
julia> mul!(C800_840, A800_900, B900_840)
800×840 StaticArrays.MArray{Tuple{800,840},Float64,2,672000}:
38.0743 -37.3638 -23.3069 43.6012 -14.9798 -22.6108 … 6.68674 31.0339 22.0678 -25.335 -12.3134 34.2845
-0.771707 -24.4486 -46.6449 37.0407 24.6849 -22.056 4.19191 6.73854 -18.4685 90.6574 34.1527 -41.9439
-4.7973 -56.2395 -0.369894 21.4934 -30.8941 -20.8357 21.4338 11.461 -26.1999 -8.51833 22.5552 -52.3642
21.6411 -4.72307 47.6128 16.3918 29.8575 -22.1772 6.87733 3.47535 -41.9409 -3.14536 39.851 6.89992
16.9652 -0.0380062 90.3321 -48.3799 55.2641 5.03981 -8.4407 9.44483 -10.5455 11.4134 22.1736 28.6386
1.10867 32.7442 -9.571 -4.61297 26.8078 35.2818 … -25.8284 -1.502 0.323376 50.3406 -9.49194 -12.6995
53.7033 -2.69979 29.6596 21.8752 -42.6504 -56.8507 -44.3737 -18.3475 -9.93175 -26.6298 -7.92967 33.2311
-31.6189 14.6708 7.66491 34.2681 22.136 -34.5229 20.1327 1.23482 11.1025 16.8833 -20.2968 51.6909
-15.412 -11.2317 23.9317 24.7645 -29.6448 -13.797 3.52927 -8.74718 4.65412 42.6386 -6.51217 0.00954396
3.85964 -28.4584 -33.407 -20.6061 -10.6759 -0.94623 4.65821 7.54115 -36.1106 -11.7496 11.2209 -1.79995
2.96696 46.217 -1.40911 75.0949 27.5513 -3.66848 … -36.9966 -45.3326 16.136 -21.0145 5.47982 -11.8452
41.6961 25.8745 29.335 4.15438 -45.8787 -44.9912 67.5224 -24.6367 56.7467 10.4153 -10.4572 -18.5692
21.3235 -30.0029 -31.4225 -38.153 -29.2 37.1059 46.3853 6.42264 1.21527 -61.9367 1.3304 18.854
-3.50268 12.5908 47.5534 4.62536 -62.8162 -5.87803 24.3344 -64.3619 -54.6748 -18.0393 1.03131 -7.18129
3.10441 17.5996 -10.1619 6.87622 -1.52725 -25.9976 -17.2381 44.9151 -30.6112 36.949 25.7487 -4.64311
⋮ ⋮ ⋱ ⋮
-0.25848 -26.5611 -21.7325 28.1408 -1.25443 25.9933 … 15.8921 42.1656 -34.5233 -56.8323 -24.6462 47.475
2.97608 -20.6029 38.6915 6.64966 45.6482 19.4757 20.9733 -24.903 21.1787 -30.8994 -16.2787 -40.509
4.02883 -34.8826 -3.34805 -38.1582 40.5341 64.5054 50.0794 -34.1333 -2.95293 18.2688 -11.8533 -38.5013
12.4209 -3.19724 -33.9069 36.8744 -28.0702 53.2995 13.0724 53.7317 14.6668 49.303 10.8247 51.3291
-18.5631 -24.0589 -6.52054 34.7071 22.1703 -18.5829 -3.24598 10.1572 -38.5507 -34.5536 -59.9946 -20.7777
-25.0644 18.3918 -36.5186 10.7124 82.628 -44.118 … -39.3714 1.20289 -21.4417 22.7627 3.55199 -36.0679
15.1521 -24.1379 -18.4929 -29.8979 26.7423 24.24 -18.1427 9.34076 81.6854 1.93507 -68.5226 -30.7109
12.0685 2.69166 15.5109 -10.6751 13.3148 13.6762 -22.8099 34.7844 6.12369 -56.9632 25.2193 72.3292
-27.2328 -45.9552 30.7724 37.1681 6.97637 7.16855 -7.0608 -31.0684 -20.0999 -6.20643 27.8918 -41.7873
24.3027 69.0994 1.83236 -10.791 1.16588 15.7833 14.2337 -37.2954 12.9956 65.0574 23.6132 65.8518
-22.6034 -3.46183 21.5065 -73.0987 41.7081 -50.3692 … 42.7244 14.4162 -55.0306 26.7348 94.2211 -24.3135
30.195 8.71624 -11.5629 -17.1744 24.1736 -77.5655 69.8622 1.56408 27.9262 -12.5686 64.3182 -13.0658
4.81069 -40.5611 -28.4828 6.60299 59.8823 16.0372 8.16483 53.9061 -6.52424 3.28285 -35.407 -56.1281
-46.878 8.07851 23.708 -18.5416 44.5693 -7.28137 -5.50011 -15.2032 17.4343 39.6803 17.2223 -5.02152
11.9364 8.46358 -33.1727 -18.1967 -36.1323 1.66343 6.10104 28.2232 14.3468 58.7046 -7.21755 -75.5267
Benchmark:
julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.793 ms (0.00% GC)
median time: 17.102 ms (0.00% GC)
mean time: 17.165 ms (0.00% GC)
maximum time: 19.265 ms (0.00% GC)
--------------
samples: 292
evals/sample: 1
julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 23.235 ms (0.00% GC)
median time: 23.628 ms (0.00% GC)
mean time: 23.698 ms (0.00% GC)
maximum time: 25.558 ms (0.00% GC)
--------------
samples: 211
evals/sample: 1
We’re still faster than OpenBLAS!
Perfect scaling would give us:
julia> (800 / 16) * (900 / 32) * (840 / 14) * 140 / 1e6
11.8125
Just over 11.8 milliseconds – we’re now around 50% slower.
MKL does it in 11.5 ms, so still better than the scaled up kernel, but only slightly so.
EDIT:
Ryzen results:
julia> @benchmark jmul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 137.429 μs (0.00% GC)
median time: 138.641 μs (0.00% GC)
mean time: 139.844 μs (0.00% GC)
maximum time: 262.424 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C128_126, $A128_128, $B128_126)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 158.137 μs (0.00% GC)
median time: 163.022 μs (0.00% GC)
mean time: 163.917 μs (0.00% GC)
maximum time: 215.655 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark jmul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.583 ms (0.00% GC)
median time: 44.016 ms (0.00% GC)
mean time: 44.166 ms (0.00% GC)
maximum time: 46.266 ms (0.00% GC)
--------------
samples: 114
evals/sample: 1
julia> @benchmark mul!($C800_840, $A800_900, $B900_840)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 41.944 ms (0.00% GC)
median time: 42.194 ms (0.00% GC)
mean time: 42.289 ms (0.00% GC)
maximum time: 43.389 ms (0.00% GC)
--------------
samples: 119
evals/sample: 1
A little slower than OpenBLAS with Ryzen at the larger size, but pretty close.
So:
So given the above, I am led to believe that a pure-Julia blas-like implementations might be realistic for things like matrix-matrix multiplication.
Definitely! Already highly competitive with OpenBLAS on the two architectures I tested. Significantly faster on one, and barely slower on the other. I think the memory can be much better optimized than it is now (ie, clever recursive loading of submatrices). However, I am worried that the optimal memory pre-fetching behavior may be very architecture dependent.
Threading would also probably be difficult? I have no idea.
The bottom curve is for the matrix multiplication carried out for MMatrices (up to size 10). EDIT: one of the matrices (gradN) is an MMatrix always.
Given that, you could definitely improve my function a lot. API-wise, get the number of columns from gradN
, instead of passing it as a Val. For another, we could split the outer loop into a variable number of pieces based on Kedim
, with L changing for each piece as appropriate.
Or, as a lazy hack, just have it call the other function when Kedim
< some threshold.
I see.
Maybe @kristoffer.carlsson can assist with that?
We’re truing to employ Intel MKL in Low Overhead Mode in Julia.
Any chance you have idea what @Elrod should do?
This looks really cool! Thanks for the pointer, I will need to take some time to wrap my head around it.