[ANN] LoopVectorization

This looks great.

I’d definitely recommend looking into packing.
Even though matrix multiplication involves only O(N^2) memory versus O(N^3) memory operations, it requires careful memory management to actually realize O(N^3) scaling.

I am far from an expert on the TLB (Translation Lookaside Buffer), but a major part of the problem is that address space calculations are expensive, but the TLB caches them, making them fast/more or less free. But it only holds so many.
Moving quickly through memory – like accessing subsequent columns in a large array – moves too quickly, invalidates the TLB, and stalls the computation as you have to wait for addresses to be calculated.
One of the primary purposes of packing is to copy memory from disparate regions into a small contiguous block. This itself takes time, but asymptotically an O(N^2) cost to reduce the coefficient on the O(N^3) computations (which otherwise will spend a high proportion of time waiting) is quickly amortized for reasonably big N.
This paper goes into a lot of detail, and is by Goto himself (OpenBLAS descended from GotoBLAS). In case you’re wondering whatever happened to Goto, last I head, he works at Intel.
To quote:

The most significant difference between a cache miss and a TLB miss is that a cache miss does not necessarily stall the CPU. A small number of cache misses can be tolerated by using algorithmic prefetching techniques as long as the data can be read fast enough from the memory where it does exist and arrives at the CPU by the time it is needed for computation. A TLB miss, by contrast, causes the CPU to stall until the TLB has been updated with the new address. In other words, prefetching can mask a cache miss but not a TLB miss.

The fundamental problem now is that A is typically a submatrix of a larger matrix, and therefore is not contiguous in memory. This in turn means that addressing it requires many more than the minimal number of TLB entries.The solution is to pack A in a contiguous work array, ̃A.

If you want a paper to base a library on, this paper is much more recent, and the BLIS folks seem to have great performance.

To avoid allocations, you could define large const arrays, and use a reinterpreted pointer (they’re global constants; they won’t get GCed) to remain generic. I’d add a Sys.CPU_THREADS factor to their size, and have each thread use their respective chunk.

It also has side benefits of being the time/place to do things like:

  1. Align the columns, so all sizes consistently high performance (like MKL) instead of only at multiples of 8, like in these benchmarks.
  2. Be the time and place to:
  1. Convert arrays of structs into structs of arrays.

May be sometime worth playing with to see if it can improve performance at sizes > 500.

As an aside, LLVM generates imperfect kernel code, but someone made a PR that fixes the problem. The problem is that when performing fmas c = a * b + c, it would sometimes overwrite a or b instead of c: a = a * b + c; c = a, generating one move instruction to shuffle around registers for each time it did this (3 times per loop seemed common), but assuming the PR gets merged.
Unfortunately, that PR also causes some regressions in the LLVM test cases, but it sounds like folks view it as a net positive. They may come up with a new heuristic to fix this problem while avoiding the regressions.
Once it’s fixed, the generated assembly should be just as good as hand-written asm kernels – but much more generic! The OpenBLAS shipping with Julia still doesn’t have AVX512 kernels for Float64.

What are your findings on 104 versus 96?
96 is a multiple of both 32 (#rows in the AVX512 kernels) and 12 (# of rows in AVX2 kernels), so I’d have though it would work well. You could also add harded-coded kernel96 functions:

julia> using BenchmarkTools, LinearAlgebra, LoopVectorization; BLAS.set_num_threads(1)

julia> C96 = Matrix{Float64}(undef, 96, 96);

julia> A96 = rand(Float64, 96, 96); B96 = rand(Float64, 96, 96);

julia> function gemm_kernel!(C, A, B)
           @avx for n ∈ 1:size(A, 1), m ∈ 1:size(B, 2)
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:size(A, 2)
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel! (generic function with 1 method)

julia> function gemm_kernel96!(C, A, B)
           @avx for n ∈ 1:96, m ∈ 1:96
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:96
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel96! (generic function with 1 method)

julia> @benchmark mul!($C96, $A96, $B96)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.675 μs (0.00% GC)
  median time:      14.720 μs (0.00% GC)
  mean time:        14.784 μs (0.00% GC)
  maximum time:     59.612 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel!($C96, $A96, $B96)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.979 μs (0.00% GC)
  median time:      16.076 μs (0.00% GC)
  mean time:        16.100 μs (0.00% GC)
  maximum time:     57.861 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel96!($C96, $A96, $B96)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     14.639 μs (0.00% GC)
  median time:      14.738 μs (0.00% GC)
  mean time:        14.764 μs (0.00% GC)
  maximum time:     52.166 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

You’d still need the unsized kernel to handle remainders. I’m a little skeptical about whether splitting it in this way will actually be faster; the fast loop in the 96 version is also inside the generic version, just the logic of checking what to do (and the alternatives) have been stripped out. But the generic matmul function of course needs them, so I wouldn’t bet moving them from one place to another actually improves performance.

julia> M = K = N = 104
104

julia> C = Matrix{Float64}(undef, M, N);

julia> A = rand(Float64, M, K); B = rand(Float64, K, N);

julia> function gemm_kernel104!(C, A, B)
           @avx for n ∈ 1:104, m ∈ 1:104
               Cₙₘ = zero(eltype(C))
               for k ∈ 1:104
                   Cₙₘ += A[n,k] * B[k,m]
               end
               C[n,m] = Cₙₘ
           end
       end
gemm_kernel104! (generic function with 1 method)

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.707 μs (0.00% GC)
  median time:      19.845 μs (0.00% GC)
  mean time:        19.921 μs (0.00% GC)
  maximum time:     62.273 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.451 μs (0.00% GC)
  median time:      21.578 μs (0.00% GC)
  mean time:        21.621 μs (0.00% GC)
  maximum time:     63.850 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemm_kernel104!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.697 μs (0.00% GC)
  median time:      20.792 μs (0.00% GC)
  mean time:        20.822 μs (0.00% GC)
  maximum time:     57.917 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Performance at 96 seems a little better relative to O(N^3) baseline than 104:

julia> (104/96)^3 * 16
20.342592592592588

julia> (104/96)^3 * 14.7
18.68975694444444

But, differences this small could easily be swamped by the threading overhead, so I could easily see that being the deciding factor in 104’s favor.

I’m still working on revamping PaddedMatrices. LoopVectorization allowed me to delete more lines of code from PaddedMatrices than are in LoopVectorization itself. The chief benefits are (a) optionally padding columns to avoid the multiples-of-8 phenomena (I still need to actually run benchmarks) and an optional sizing on a dim-bydim basis. Broadcasting defaults to LoopVectorization’s broadcasts, RNG uses VectorizedRNG (which I’m currently updating to use a Xorshift by default instead of a PCG, as it does currently), etc.

I would be interested in making Gaius a dependency, using Gauius.* and Gauius.mul!.

Can you post the script you used to run those benchmarks and create the plots?
I have a machine with AVX512 (and 18 cores/36 threads) to test on. I also have Julia 1.5 with MKL and Julia 1.4 with OpenBLAS if you’d like comparisons with both.
OpenBLAS will get more interesting once Julia updates to OpenBLAS 3.8, since it sounds like their sgemm and dgemm may finally be improving.

@JeffreySarnoff That would only add support for Int128. It’d take different work for struct types that contain a mix of already supported types.
A good solution would involve:

  1. Add functionality to LoopVectorization to support the possibility of vectors needing N registers, instead of 1, and deducing what N is for these types.
  2. Mapping those types to a StructVec type, perhaps similar to StructArrays. It’d be a struct holding an tuple of Vecs,
  3. Somehow having operations work on these StructVecs as they should. If c in c = a * b is defined as c.re = a.re * b.re - a.im * b.im; c.im = a.re * b.im + a.im * b.re, somehow dispatch on that.

Anyone have ideas on how this can be done? I think 3. is the most difficult, and I don’t have an idea of how to do this automatically.

18 Likes