[ANN] LoopVectorization

Oscar_Smith · February 29, 2020, 2:40am

Any chance you could implement a mixed type multiplication which does conversation in blocks, which I think would be faster?

Mason · February 29, 2020, 3:04am

I’m not sure I understand. Could you elaborate? The current strategy is to split the matrix into blocks and then recursively keep on splitting until it arrives at a size that’s appropriate for single threaded execution.

JeffreySarnoff · February 29, 2020, 3:50am

LoopVectorization could become available to additional numeric types that pair Float64s (e.g. Complex, Dual, DoubleFloats). Please see the end of Elrod’s next comment for more thought.

(original issue)

Elrod · February 29, 2020, 4:28am

This looks great.

I’d definitely recommend looking into packing.
Even though matrix multiplication involves only O(N^2) memory versus O(N^3) memory operations, it requires careful memory management to actually realize O(N^3) scaling.

I am far from an expert on the TLB (Translation Lookaside Buffer), but a major part of the problem is that address space calculations are expensive, but the TLB caches them, making them fast/more or less free. But it only holds so many.
Moving quickly through memory – like accessing subsequent columns in a large array – moves too quickly, invalidates the TLB, and stalls the computation as you have to wait for addresses to be calculated.
One of the primary purposes of packing is to copy memory from disparate regions into a small contiguous block. This itself takes time, but asymptotically an O(N^2) cost to reduce the coefficient on the O(N^3) computations (which otherwise will spend a high proportion of time waiting) is quickly amortized for reasonably big N.
This paper goes into a lot of detail, and is by Goto himself (OpenBLAS descended from GotoBLAS). In case you’re wondering whatever happened to Goto, last I head, he works at Intel.
To quote:

The most significant difference between a cache miss and a TLB miss is that a cache miss does not necessarily stall the CPU. A small number of cache misses can be tolerated by using algorithmic prefetching techniques as long as the data can be read fast enough from the memory where it does exist and arrives at the CPU by the time it is needed for computation. A TLB miss, by contrast, causes the CPU to stall until the TLB has been updated with the new address. In other words, prefetching can mask a cache miss but not a TLB miss.
…
The fundamental problem now is that A is typically a submatrix of a larger matrix, and therefore is not contiguous in memory. This in turn means that addressing it requires many more than the minimal number of TLB entries.The solution is to pack A in a contiguous work array, ̃A.

If you want a paper to base a library on, this paper is much more recent, and the BLIS folks seem to have great performance.

To avoid allocations, you could define large const arrays, and use a reinterpreted pointer (they’re global constants; they won’t get GCed) to remain generic. I’d add a Sys.CPU_THREADS factor to their size, and have each thread use their respective chunk.

It also has side benefits of being the time/place to do things like:

Align the columns, so all sizes consistently high performance (like MKL) instead of only at multiples of 8, like in these benchmarks.
Be the time and place to:

Convert arrays of structs into structs of arrays.

May be sometime worth playing with to see if it can improve performance at sizes > 500.

As an aside, LLVM generates imperfect kernel code, but someone made a PR that fixes the problem. The problem is that when performing fmas c = a * b + c, it would sometimes overwrite a or b instead of c: a = a * b + c; c = a, generating one move instruction to shuffle around registers for each time it did this (3 times per loop seemed common), but assuming the PR gets merged.
Unfortunately, that PR also causes some regressions in the LLVM test cases, but it sounds like folks view it as a net positive. They may come up with a new heuristic to fix this problem while avoiding the regressions.
Once it’s fixed, the generated assembly should be just as good as hand-written asm kernels – but much more generic! The OpenBLAS shipping with Julia still doesn’t have AVX512 kernels for Float64.

What are your findings on 104 versus 96?
96 is a multiple of both 32 (#rows in the AVX512 kernels) and 12 (# of rows in AVX2 kernels), so I’d have though it would work well. You could also add harded-coded kernel96 functions:

julia> using BenchmarkTools, LinearAlgebra, LoopVectorization; BLAS.set_num_threads(1)

julia> C96 = Matrix{Float64}(undef, 96, 96);

julia> A96 = rand(Float64, 96, 96); B96 = rand(Float64, 96, 96);

julia> function gemm_kernel!(C, A, B)
           @avx for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel! (generic function with 1 method)

julia> function gemm_kernel96!(C, A, B)
           @avx for n ∈ 1:96, m ∈ 1:96
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:96
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel96! (generic function with 1 method)

julia> @benchmark mul!($C96, $A96, $B96)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.675 μs (0.00% GC)
  median time:      14.720 μs (0.00% GC)
  mean time:        14.784 μs (0.00% GC)
  maximum time:     59.612 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel!($C96, $A96, $B96)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.979 μs (0.00% GC)
  median time:      16.076 μs (0.00% GC)
  mean time:        16.100 μs (0.00% GC)
  maximum time:     57.861 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel96!($C96, $A96, $B96)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.639 μs (0.00% GC)
  median time:      14.738 μs (0.00% GC)
  mean time:        14.764 μs (0.00% GC)
  maximum time:     52.166 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

You’d still need the unsized kernel to handle remainders. I’m a little skeptical about whether splitting it in this way will actually be faster; the fast loop in the 96 version is also inside the generic version, just the logic of checking what to do (and the alternatives) have been stripped out. But the generic matmul function of course needs them, so I wouldn’t bet moving them from one place to another actually improves performance.

julia> M = K = N = 104
104

julia> C = Matrix{Float64}(undef, M, N);

julia> A = rand(Float64, M, K); B = rand(Float64, K, N);

julia> function gemm_kernel104!(C, A, B)
           @avx for n ∈ 1:104, m ∈ 1:104
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:104
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel104! (generic function with 1 method)

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.707 μs (0.00% GC)
  median time:      19.845 μs (0.00% GC)
  mean time:        19.921 μs (0.00% GC)
  maximum time:     62.273 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.451 μs (0.00% GC)
  median time:      21.578 μs (0.00% GC)
  mean time:        21.621 μs (0.00% GC)
  maximum time:     63.850 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel104!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.697 μs (0.00% GC)
  median time:      20.792 μs (0.00% GC)
  mean time:        20.822 μs (0.00% GC)
  maximum time:     57.917 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Performance at 96 seems a little better relative to O(N^3) baseline than 104:

julia> (104/96)^3 * 16
20.342592592592588

julia> (104/96)^3 * 14.7
18.68975694444444

But, differences this small could easily be swamped by the threading overhead, so I could easily see that being the deciding factor in 104’s favor.

I’m still working on revamping PaddedMatrices. LoopVectorization allowed me to delete more lines of code from PaddedMatrices than are in LoopVectorization itself. The chief benefits are (a) optionally padding columns to avoid the multiples-of-8 phenomena (I still need to actually run benchmarks) and an optional sizing on a dim-bydim basis. Broadcasting defaults to LoopVectorization’s broadcasts, RNG uses VectorizedRNG (which I’m currently updating to use a Xorshift by default instead of a PCG, as it does currently), etc.

I would be interested in making Gaius a dependency, using Gauius.* and Gauius.mul!.

Can you post the script you used to run those benchmarks and create the plots?
I have a machine with AVX512 (and 18 cores/36 threads) to test on. I also have Julia 1.5 with MKL and Julia 1.4 with OpenBLAS if you’d like comparisons with both.
OpenBLAS will get more interesting once Julia updates to OpenBLAS 3.8, since it sounds like their sgemm and dgemm may finally be improving.

@JeffreySarnoff That would only add support for Int128. It’d take different work for struct types that contain a mix of already supported types.
A good solution would involve:

Add functionality to LoopVectorization to support the possibility of vectors needing N registers, instead of 1, and deducing what N is for these types.
Mapping those types to a StructVec type, perhaps similar to StructArrays. It’d be a struct holding an tuple of Vecs,
Somehow having operations work on these StructVecs as they should. If c in c = a * b is defined as c.re = a.re * b.re - a.im * b.im; c.im = a.re * b.im + a.im * b.re, somehow dispatch on that.

Anyone have ideas on how this can be done? I think 3. is the most difficult, and I don’t have an idea of how to do this automatically.

Mason · February 29, 2020, 5:39am

Okay, so I was actually mistaken about 104 being optimal. I pushed a change to the repo that changes the default block size from 104 to 64 and this is my new graph for Float64:

F64_mul

Edit: Disregard the above plot, I think it’s in error, I’ll look more carefully in the morning

Absolutely crushing OpenBLAS.

I’m running the Float32 and integer benchmarks now and I’ll update the README with them when they finish, but it appears to be a much better choice than 104.

That’s a great suggestion, I’ll try to include some kernel64 functions.

The TLB stuff is a bit beyond me for the near future, but I’ll keep it in mind as a future goal once I have time to do some reading and learning.

Here’s you go:

using StructArrays
times = (StructVector ∘ map)([8:8:120..., round.(Int, (10:30) .^2.2)...]) do sz
    n, k, m = sz, sz, sz
    C1 = zeros(n, m)
    C2 = zeros(n, m)
    C3 = zeros(n, m)
    A  = randn(n, k)
    B  = randn(k, m)

    lvb = @belapsed blocked_mul!($C1, $A, $B) samples=100
    opb = @belapsed mul!($C2, $A, $B)         samples=100
    (matrix_size=sz, lvBLAS=lvb, OpenBLAS=opb)
end

using Plots
plot( times.matrix_size, times.lvBLAS, legend=:bottomright, 
      label="Gaius.jl", yscale=:log10, xscale=:log10, fmt=:png, xlabel="Matrix Dimension", ylabel="Runtime (seconds)",
      title="Float64 Matrix Multiplication")
plot!(times.matrix_size, times.OpenBLAS, label="OpenBLAS")

and change as appropriate for types. For integer types I did rand(-30:30, n, k) and rand(Int.(-30:30), n, k).

Hah, I was actually looking into depending on PaddedMatrices. That works too.

carstenbauer · February 29, 2020, 7:38am

I’m cheering for you guys! It feels like LoopVectorization is a time travel machine that lets us jump to what I thought would be a far distant Future!

tim.holy · February 29, 2020, 11:10am

The Julia generic matmatmul is already blocked; does the fact that you get similar performance until you turn on threading mean that the only real advantage for integer-matrix multiplication comes from threading? What if you add threading to the generic algorithm in Julia? (The reason for my question is that LoopVectorization is not a tiny dependency.)

Elrod · February 29, 2020, 5:03pm

I don’t know what architecture Mason is running on, but SIMD integer operations require AVX2 or AVX512DQ (AVX512 double- and quad-word) extensions.
Currently, LoopVectorization isn’t checking for these, but implicitly assuming that if you have AVX, you also have AVX2, and if you have AVX512F (foundation), that you also have DQ.
Ivy Bridge and Sandy Bridge both have AVX but not AVX2.
Xeon Phi Knight’s Landing and Knight’s Mill would fall in the AVX512 crack of not having DQ. These were Intel’s attempt at making x86 good at GPU workloads (some fit in PCI-e slots, and they had 56-72 cores).

It will generate slow integer code on those architectures, but LLVM will thankfully still generate correct code rather than crash or yield wrong answers in most cases.
Unfortunately, that isn’t always a guarantee. Functions like vfilter provided in LoopVectorization do depend on AVX512 to be fast, and IIRC, it will crash on LLVM 8 if you don’t have it, but with LLVM 9 it will work correctly regardless. Julia’s master branch is already on LLVM 9.
Now that I write this out, it seems obvious to just fall back on regular filter.

I find @avx to be several times faster with integers:

julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

julia> function gemm_kernel!(C, A, B)
           @avx for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel! (generic function with 1 method)

julia> A = rand(-100:200, 200, 200);

julia> B = rand(-100:200, 200, 200);

julia> C = similar(A);

julia> C2 = similar(A);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  336 bytes
  allocs estimate:  6
  --------------
  minimum time:     2.024 ms (0.00% GC)
  median time:      2.068 ms (0.00% GC)
  mean time:        2.057 ms (0.00% GC)
  maximum time:     2.173 ms (0.00% GC)
  --------------
  samples:          2428
  evals/sample:     1

julia> @benchmark gemm_kernel!($C2, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     576.202 μs (0.00% GC)
  median time:      579.474 μs (0.00% GC)
  mean time:        579.663 μs (0.00% GC)
  maximum time:     670.881 μs (0.00% GC)
  --------------
  samples:          8598
  evals/sample:     1

julia> C == C2
true

vfilter:

julia> vfilter(isodd, A) == filter(isodd, A)
true

julia> @benchmark vfilter(isodd, $A)
BenchmarkTools.Trial:
  memory estimate:  312.58 KiB
  allocs estimate:  2
  --------------
  minimum time:     9.604 μs (0.00% GC)
  median time:      11.239 μs (0.00% GC)
  mean time:        13.786 μs (16.48% GC)
  maximum time:     551.841 μs (92.44% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark filter(isodd, $A)
BenchmarkTools.Trial:
  memory estimate:  312.58 KiB
  allocs estimate:  3
  --------------
  minimum time:     18.864 μs (0.00% GC)
  median time:      23.070 μs (0.00% GC)
  mean time:        26.286 μs (9.99% GC)
  maximum time:     648.130 μs (89.34% GC)
  --------------
  samples:          10000
  evals/sample:     1

@carstenbauer But I confess that it would be great if MLIR renders much of LoopVectorization obsolete. Another motivation of LoopVectorization was having a model for loops that lets me autodiff them for the (still in development and totally broken) ProbabilityModels MCMC library.

Mason · February 29, 2020, 5:44pm

Alas it was not meant to be. Fun fact, if you use a macro that spawns tasks but is not named Threads.@spawn, it turns out that @sync will ignore those tasks and let them propagate on their merry way possibly creating race conditions. This made it look like I was beating OpenBLAS but in fact I was producing random errors due to race conditions.

Oops! Always read the docs folks and run your tests before you get too excited

Yes, I only have access to AVX2 machines currently. I’d be eager to see AVX5 results if someone wants to run the benchmarks on a machine with those instructions.

Elrod · February 29, 2020, 6:11pm

FWIW, I suggested 96, not 64. 64 isn’t divisible by 12, after all.
I ran the benchmarks using 64, and the results didn’t look good; MKL was about 4 times faster.
I also did add an @assert that the results are approximately equal. I didn’t get any assertion errors.

julia> using BenchmarkTools

julia> using Gaius: mul!

julia> A = rand(1000, 1000);

julia> B = rand(1000, 1000);

julia> C = similar(A);

julia> @benchmark mul!($C, $A, $B, block_size = 64)
BenchmarkTools.Trial:
  memory estimate:  4.58 MiB
  allocs estimate:  61398
  --------------
  minimum time:     3.167 ms (0.00% GC)
  median time:      4.209 ms (0.00% GC)
  mean time:        5.040 ms (17.55% GC)
  maximum time:     36.717 ms (90.62% GC)
  --------------
  samples:          991
  evals/sample:     1

julia> @benchmark mul!($C, $A, $B, block_size = 96)
BenchmarkTools.Trial:
  memory estimate:  1.52 MiB
  allocs estimate:  19348
  --------------
  minimum time:     2.598 ms (0.00% GC)
  median time:      3.336 ms (0.00% GC)
  mean time:        3.619 ms (7.38% GC)
  maximum time:     28.206 ms (87.08% GC)
  --------------
  samples:          1379
  evals/sample:     1

julia> @benchmark mul!($C, $A, $B, block_size = 104)
BenchmarkTools.Trial:
  memory estimate:  1.34 MiB
  allocs estimate:  16963
  --------------
  minimum time:     3.221 ms (0.00% GC)
  median time:      4.185 ms (0.00% GC)
  mean time:        4.453 ms (5.23% GC)
  maximum time:     24.437 ms (82.67% GC)
  --------------
  samples:          1124
  evals/sample:     1

julia> @benchmark mul!($C, $A, $B, block_size = 128)
BenchmarkTools.Trial:
  memory estimate:  762.89 KiB
  allocs estimate:  9212
  --------------
  minimum time:     2.287 ms (0.00% GC)
  median time:      3.029 ms (0.00% GC)
  mean time:        3.176 ms (3.95% GC)
  maximum time:     24.112 ms (87.43% GC)
  --------------
  samples:          1571
  evals/sample:     1

julia> @benchmark mul!($C, $A, $B, block_size = 160)
BenchmarkTools.Trial:
  memory estimate:  441.69 KiB
  allocs estimate:  5212
  --------------
  minimum time:     2.311 ms (0.00% GC)
  median time:      3.277 ms (0.00% GC)
  mean time:        3.363 ms (1.86% GC)
  maximum time:     19.074 ms (81.13% GC)
  --------------
  samples:          1484
  evals/sample:     1

I’ll rerun the benchmarks with a block size of 96.

EDIT: Looking at Gaius’s source, I think I misunderstood the meaning of block size.
Seems that half of the block size is what’s actually used.

Larger blocks may be better for large matrices, but don’t yield as much threading for small ones.

AVX512 definitely requires the block_size parameter to be divisible by 16 then, so that the half-blocks are at least divisible by 8.

Mason · February 29, 2020, 7:16pm

Yeah, I tried 96 and didn’t find it as good on my system as 104, but maybe that difference is made up for with AVX512.

In the future, I’ll probably need to do some sort of hardware detection to decide optimal block sizing.

Elrod · February 29, 2020, 7:17pm

I made a few changes that I’ll push in a little while, one of which is to choose between 104 and 128 based on whether or not the computer has AVX512.

VectorizationBase, a dependency of LoopVectorization, has lots of useful hardware info, including cache sizes (it uses CpuId.jl to create a .jl file listing it all while building). I could add more.

Float64:

Float32:

Int64:

Int32:

I may run benchmarks over a broader array of sizes later.
At 1777 x 1777 (the largest size in the plots), MKL was about 3 times faster.

Also, good to know that blocking is a clear win in terms of single threaded performance, and that (after my commit) all allocations are a result of Julia’s threading:

julia> using Gaius, LinearAlgebra, BenchmarkTools, LoopVectorization

julia> const blocked_mul! = Gaius.mul!
mul! (generic function with 2 methods)

julia> function gemm_kernel!(C, A, B)
           @avx for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel! (generic function with 1 method)

julia> M = 1000
1000

julia> A = rand(M, M); B = rand(M, M); C = similar(A);

julia> BLAS.set_num_threads(1); Threads.nthreads()
1

julia> @benchmark gemm_kernel!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     85.981 ms (0.00% GC)
  median time:      86.109 ms (0.00% GC)
  mean time:        86.159 ms (0.00% GC)
  maximum time:     88.160 ms (0.00% GC)
  --------------
  samples:          59
  evals/sample:     1

julia> @benchmark blocked_mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.458 ms (0.00% GC)
  median time:      30.486 ms (0.00% GC)
  mean time:        30.435 ms (0.00% GC)
  maximum time:     32.968 ms (0.00% GC)
  --------------
  samples:          165
  evals/sample:     1

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.643 ms (0.00% GC)
  median time:      18.681 ms (0.00% GC)
  mean time:        18.689 ms (0.00% GC)
  maximum time:     19.385 ms (0.00% GC)
  --------------
  samples:          268
  evals/sample:     1

Elrod · February 29, 2020, 9:12pm

Oh, and…

julia> M = 2000;

julia> A = rand(M, M); Bdim3 = rand(2, M, M); C = similar(A);

julia> Bdim2 = @view(Bdim3[1,:,:]);

julia> @benchmark blocked_mul!($C, $A, $Bdim2)
BenchmarkTools.Trial:
  memory estimate:  2.05 MiB
  allocs estimate:  16387
  --------------
  minimum time:     29.341 ms (0.00% GC)
  median time:      30.528 ms (0.00% GC)
  mean time:        30.671 ms (0.00% GC)
  maximum time:     40.037 ms (0.00% GC)
  --------------
  samples:          164
  evals/sample:     1

julia> C2 = similar(C);

julia> @benchmark mul!($C2, $A, $Bdim2)
BenchmarkTools.Trial:
  memory estimate:  336 bytes
  allocs estimate:  6
  --------------
  minimum time:     6.239 s (0.00% GC)
  median time:      6.239 s (0.00% GC)
  mean time:        6.239 s (0.00% GC)
  maximum time:     6.239 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> C ≈ C2
true

Seems BLAS requires the first stride to be 1?

Maybe there should be a fallback definition that copies the SubArray in that case?

julia> @benchmark mul!($C2, $A, copy($Bdim2))
BenchmarkTools.Trial:
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     15.818 ms (0.00% GC)
  median time:      16.057 ms (0.00% GC)
  mean time:        16.549 ms (0.56% GC)
  maximum time:     22.676 ms (15.01% GC)
  --------------
  samples:          302
  evals/sample:     1

Mason · February 29, 2020, 9:48pm

Fantastic!

@Elrod How did you produce those plots? I could never get log-log minor gird-lines working in Plots.j;

PetrKryslUCSD · February 29, 2020, 10:06pm

Wow, Gaius is faster than MKL for small matrices! Nice job, @Mason and @Elrod!

RoyiAvital · March 1, 2020, 6:01am

I wonder if they use the correct flags for MKL to be optimized for small matrices.
See Intel Math Kernel Library Improved Small Matrix Performance Using Just-in-Time (JIT) Code Generation for Matrix Multiplication (GEMM).

@Elrod, Do you use the above or just MKL.jl?

It is a great accomplishment to beat MKL / OpenBLAS like you do.

Elrod · March 1, 2020, 6:59am

Benchmarks vs OpenBLAS, Float64:

And Float32:

This was with Julia 1.4.2, which I believe is using OpenBLAS v0.3.5 (based on the deps/openblas.version file). The latest version, v0.3.8, added more AVX512 kernels so it should be faster.
OpenBLAS is many times slower than MKL on this CPU.

Int64 over a broader range of sizes; these took a long time to run:

Oddly, Julia’s generic matmul took around twice as long for Int32, while Gaius was about twice as fast:

I built Julia from source with USE_MKL=1 in my Make.user file.
It doesn’t use the JIT.

I experimented with the JIT before. All that’s needed is the Intel compilers and:

module jitmul

include "mkl_direct_call.fi"

use ISO_C_BINDING
implicit none

contains

    ! subroutine dgemmjit(C,A,B,M,K,N) bind(C, name = "dgemmjit")
    subroutine dgemmjit(C,A,B,M,K,N,alpha,beta) bind(C, name = "dgemmjit")
      integer,                        intent(in)  :: M, K, N
!      real(C_double), parameter                   :: alpha = 1.0D0, beta = 0.0D0
      real(C_double),                 intent(in)  :: alpha, beta
      real(C_double), dimension(M,K), intent(in)  :: A
      real(C_double), dimension(K,N), intent(in)  :: B
      real(C_double), dimension(M,N), intent(out) :: C

      call dgemm('N', 'N', M, N, K, alpha, A, M, B, K, beta, C, M)

    end subroutine dgemmjit

end module jitmul

You can compile with
ifort -fast -mkl -fpp -DMKL_DIRECT_CALL_SEQ_JIT -shared -fPIC fortran_mul.f90 -o libgemmjit.so
From there, just ccall dgemmjit. It did very well when I tested it.

I could add some of these to LoopVectorization’s benchmarks, which cover square matrices at all sizes from 2 through 256.

mohamed82008 · March 1, 2020, 8:07am

I love this thread, amazing work! I just have a minor suggestion if you don’t mind. I have run into this problem multiple times before where there seems to be no efficient way of doing A' * B * C where B is a symmetric matrix and A and C are tall matrices. Julia now has 3-arg dot methods to do this but only for vector A and C (add generalized dot product by dkarrasch · Pull Request #32739 · JuliaLang/julia · GitHub). If there is a way to define the generalized kernel for matrix A and C using LoopVectorization where B is a sparse symmetric matrix, that would be extremely useful to speedup https://github.com/JuliaLinearAlgebra/AlgebraicMultigrid.jl and LOBPCG in IterativeSolvers.jl among other things where the bottleneck is the operation A' * B * C and usually A === C.

antoine-levitt · March 1, 2020, 9:19am

Agreed, that’s a function lacking in BLAS: the ability to compute A*B, asserting that the product is in fact symmetric. Maybe Intel | Data Center Solutions, IoT, and PC Innovation could help, never tried it.

Mason · March 1, 2020, 7:18pm

I think I can pretty easily support a special product like A' * B * C where B is a symmetric matrix and A and C are tall matrices, if all the matrices are dense.

All this stuff is pretty heavily reliant on the idea of dense matrices. Efficient sparse linear algebra is a whole different ballgame which I know even less about than dense linear algebra (and I’m really out of my element even in the dense case!)

Topic		Replies	Views
LoopVectorization.jl: adding `@avx` makes code slower Performance question , tullio , loopvectorization	8	1159	August 29, 2020
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2156	March 17, 2021
Simple loop won't vectorize New to Julia	12	1629	January 29, 2019
Can I try updating LoopVectorization? General Usage loopvectorization	4	374	January 26, 2025
I'm looking at some nice AVX2 instructions, and I'm wondering if there is a path t General Usage	1	803	March 6, 2021

[ANN] LoopVectorization

Related topics