Why mul! is so fast?

Apologies for all those deleted posts. I keep accidentally hot-keying post before I’m actually done. =/

We can actually get much faster with a pure Julia implementation, at least if we use statically sized arrays:

julia> using PaddedMatrices, BenchmarkTools

julia> A = @Mutable randn(50,50);

julia> x = @Mutable randn(50);

julia> y = FixedSizeVector{50,Float64}(undef);

julia> Aa = Array(A); xa = Array(x); ya = Array(y);

julia> @benchmark mul!($y, $A, $x) #mul! defined in PaddedMatrices
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     58.239 ns (0.00% GC)
  median time:      58.377 ns (0.00% GC)
  mean time:        61.044 ns (0.00% GC)
  maximum time:     95.340 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

julia> @benchmark mul!($ya, $Aa, $xa) # LinearAlgebra.mul!; BLAS
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     192.109 ns (0.00% GC)
  median time:      202.958 ns (0.00% GC)
  mean time:        210.614 ns (0.00% GC)
  maximum time:     2.324 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     641

The key advantage at these small sizes is not cache efficiency, but increasing the number % of time spent computing (fma instructions) relative to moving memory in and out or registers.
Lets look at your code:

function mymuloriginal!(y, A, x)
    y .= 0.0
    @inbounds for j = 1:length(x)
        @simd for i = 1:length(y)
            y[i] += A[i,j] * x[j]
        end
    end
end

The inner loop requires loading each element of y, and a column of A. Every element of y therefore gets loaded length(x) times. That is the #1 problem.
If the inner loop isn’t unrolled (beyond vectorizing), each iteration of the inner loop will do the following:

  1. Load a vector from A.
  2. Load a vector from y.
  3. update y with an fma instruction: ynew = A * x + yold
  4. store ynew.

Most modern x86_64 CPUs can perform up to 2 aligned vector loads / clock cycle, or just 1 load the crosses a 64 byte boundary. With avx2 (32 bytes), half of unaligned loads would cross this boundary (unless the loads happen to be aligned by chance). With avx512 (64 byte loads), all would cross that boundary, unless they’re aligned by chance*.

  • In Julia, a small vector of doubles is aligned to 8 byte boundaries, meaning 1/4 chance that they’ll be aligned to a 32 byte boundary (guaranteeing avx2 never crosses 64 byte boundaries), and 1/8 for avx512. Big arrays will be 64-byte aligned.

These CPUs can also perform up to 2 fma (fused-multiply-add) instructions / clock cycle.
But, above, we already wrote down that they’re doing 2 loads per 1 fma, so the CPU will hit the cap on the number of loads, and can do at most a single fma per clock cycle (less if the loads cross those 64-byte alignment boundaries).

With that in mind, how can we speed up your mul?
Let’s avoid loading and storing into y as much as possible. Note that we’ll still want to do several vectors of y at a time, because each iteration of the j loop depends on the former; we don’t want to be bottle necked by instruction latency either.

You’ll need to pick an unroll factor.
4 * SIMD width is a good choice. For avx2 with double precision, SIMD width is 4, so I use 16 here (although I’m on a computer with 32 is actually a better choice in general).

function mymul!(y, A, x)
    y .= 0.0
    N = length(y)
    Nrep, Nrem = divrem(N, 16)
    ioffset = 0
    @inbounds for _ = 1:Nrep
        Base.Cartesian.@nexprs 16 i -> y_i = y[i+ioffset]
        for j = 1:length(x)
            Base.Cartesian.@nexprs 16 i -> y_i = muladd(A[i+ioffset,j], x[j], y_i)
        end
        Base.Cartesian.@nexprs 16 i -> y[i+ioffset] = y_i
        ioffset += 16
    end
    @inbounds for j = 1:length(x)
        for i = 1:Nrem
            y[i+ioffset] += A[i+ioffset,j] * x[j]
        end
    end
end

Let’s time it

julia> @benchmark mymul!($ya, $Aa, $xa)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     257.567 ns (0.00% GC)
  median time:      271.970 ns (0.00% GC)
  mean time:        284.745 ns (0.00% GC)
  maximum time:     405.791 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     349

vs for your original version:

julia> @benchmark mymuloriginal!($ya, $Aa, $xa)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     699.648 ns (0.00% GC)
  median time:      703.262 ns (0.00% GC)
  mean time:        733.009 ns (0.00% GC)
  maximum time:     4.549 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     145

So we’re a lot closer to BLAS, but still way behind PaddedMatrices.
To try and at least beat BLAS, lets be more clever about how we handle the remainder.

I’ll also get a little silly with some other optimizations, namely note that

julia> 117 >>> 2 == 117 ÷ 4
true

julia> 117 & 3 == 117 % 4
true

These are a faster way to calculate the number of repetitions and remainders we need to break down the calculation into successively smaller chunks, but they only work for powers of 2. We’ll go for three chunks:

  1. A fast chunk, of 4W. We’ll do length(y) ÷ (4W) of these.
  2. Chunks of size W. We’ll do (length(y) % (4W)) ÷ W of these.
  3. The remainder. We’ll do 1 of these if length(y) % W > 0, 0 otherwise.

We do the first fast chunk in batches of 4, because it takes a few clock cycles (4 on my computer) for an fma instruction to actually complete. Because each iteration j in your loop depends on the previous, we’d have to wait for one to complete before starting the other. By having 4 going on at the same time, we’ll do a lot less waiting.
FLOPS will actually theoretically peak at 8 concurrent fma instructions (2 fma/clock cycle * 4 cycle latency), but 8W can be a big number. On my computer, W is actually 8, so 8W == 64, which means that with a length(y) == 50 like in this example, I wouldn’t be doing the fast iteration at all. And not doing the fast iteration is slow, not fast.
Another advantage is that the bigger this number, the more we get to reuse a value we load from x. So this is a number you could try playing with.

I’ll also add two dependencies that you can install with:

using Pkg
Pkg.add(PackageSpec(url="https://github.com/chriselrod/VectorizationBase.jl"))
Pkg.add(PackageSpec(url="https://github.com/chriselrod/SIMDPirates.jl"))
using VectorizationBase, SIMDPirates
const W, Wshift = VectorizationBase.pick_vector_width_shift(Float64) # 8, 3 on this computer

Note that Wshift = log2(W).
This will make it easier to explicitly do the batches of size W.

function mymulfinal!(y, A, x)
    N = length(y)
    Nrep = N >>> (Wshift + 2)
    Nrem = N & (4W - 1)
    ioffset = 0
    GC.@preserve y A begin # protect A and y from GC
    vptrA = vectorizable(A); vptry = vectorizable(y)
    # We calculate 4W at a time
    for _ = 1:Nrep
        Base.Cartesian.@nexprs 4 i -> vy_i = vbroadcast(Vec{W,Float64}, 0.0)
        for j = 1:length(x)
            @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
            Base.Cartesian.@nexprs 4 i -> begin
                 vA_i = vload(Vec{W,Float64}, vptrA, W*(i-1) + ioffset + (j-1)*stride(A,2))
                 vy_i = vmuladd(vA_i, vx, vy_i)
            end
        end
        Base.Cartesian.@nexprs 4 i -> vstore!(vptry, vy_i, W*(i-1) + ioffset)
        ioffset += 4W
    end
    # Now we do W at a time
    Nrep = Nrem >>> Wshift
    Nrem &= (W - 1)
    for _ = 1:Nrep
        vy = vbroadcast(Vec{W,Float64}, 0.0)
        for j = 1:length(x)
            @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
             vA = vload(Vec{W,Float64}, vptrA, ioffset + (j-1)*stride(A,2))
             vy = vmuladd(vA, vx, vy)
        end
        vstore!(vptry, vy, ioffset)
        ioffset += W
    end
    Nrem == 0 && return y
    mask = VectorizationBase.mask(Float64, Nrem)
    vy = vbroadcast(Vec{W,Float64}, 0.0)
    for j = 1:length(x)
        @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
         vA = vload(Vec{W,Float64}, vptrA, ioffset + (j-1)*stride(A,2), mask)
         vy = vmuladd(vA, vx, vy)
    end
    vstore!(vptry, vy, ioffset, mask)
    end # GC.@preserve
    y
end

A few explanations:

  • The vload and vstore! functions operate on pointers, which is why we need GC.@preserve. Julia’s garbage collector does not track pointers, and we don’t want to risk it ever thinking we’re not using A or y anymore, while we’re using pointers to their memory.
  • The vectorizable function actually returns a wrapper to a pointer that supports “pointer arithmetic”, so that we don’t have to multiply by the number of bytes.
  • vbroadcast creates a vector where every element is the argument.
  • vload loads a vector of consecutive elements. If we give it a mask, it’ll only load the elements where the mask is set to 1. This lets us avoid reading past the end of the vector / a matrix.
  • vstore! is similar, but for storing a vector.
  • Finally, we stop loading from y altogether, so that we also don’t have to set it to zero.

Benchmarking:

julia> @benchmark mymulfinal!($ya, $Aa, $xa)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     181.564 ns (0.00% GC)
  median time:      181.800 ns (0.00% GC)
  mean time:        185.280 ns (0.00% GC)
  maximum time:     384.583 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     690

julia> using LinearAlgebra; BLAS.vendor()
:mkl

Just edging out MKL (where we had 192ns minimum). :slight_smile:

64 Likes