Why mul! is so fast?

I wrote my own (simple) version of mul!, for comparison:

function mymul!(y, A, x)
    y .= 0.0
    @inbounds for j = 1:length(x)
        @simd for i = 1:length(y)
            y[i] += A[i,j] * x[j]
        end
    end
end

However comparing this to LinearAlgebra.mul! shows that the later is much faster.

n = 50; y = randn(n); x = randn(n); A = randn(n,n);
@btime mul!($y, $A, $x);
#  459.987 ns (0 allocations: 0 bytes)
@btime mymul!($y, $A, $x);
#  935.129 ns (0 allocations: 0 bytes)

For simplicity let’s assume that we are only dealing witth dense Float64 arrays (*). In this case, what is mul! doing that makes it so fast? I know it calls BLAS. But why is it faster? Is there something one can do in Julia code to improve the performance of mymul!?

(*) Maybe after I’ve understood the dense, Float64 case, I can wonder about how to make it fast and more generic.

2 Likes

Cuz mul! for matrix-vector product actually calls BLAS, which takes care of the cache efficiency etc.

If you are interesting in writing a BLAS routine, checkout this https://github.com/flame/how-to-optimize-gemm/wiki#the-gotoblasblis-approach-to-optimizing-matrix-matrix-multiplication---step-by-step

and its pure Julia implementation: https://github.com/JuliaBLAS/JuliaBLAS.jl

4 Likes

Apologies for all those deleted posts. I keep accidentally hot-keying post before I’m actually done. =/

We can actually get much faster with a pure Julia implementation, at least if we use statically sized arrays:

julia> using PaddedMatrices, BenchmarkTools

julia> A = @Mutable randn(50,50);

julia> x = @Mutable randn(50);

julia> y = FixedSizeVector{50,Float64}(undef);

julia> Aa = Array(A); xa = Array(x); ya = Array(y);

julia> @benchmark mul!($y, $A, $x) #mul! defined in PaddedMatrices
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     58.239 ns (0.00% GC)
  median time:      58.377 ns (0.00% GC)
  mean time:        61.044 ns (0.00% GC)
  maximum time:     95.340 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

julia> @benchmark mul!($ya, $Aa, $xa) # LinearAlgebra.mul!; BLAS
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     192.109 ns (0.00% GC)
  median time:      202.958 ns (0.00% GC)
  mean time:        210.614 ns (0.00% GC)
  maximum time:     2.324 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     641

The key advantage at these small sizes is not cache efficiency, but increasing the number % of time spent computing (fma instructions) relative to moving memory in and out or registers.
Lets look at your code:

function mymuloriginal!(y, A, x)
    y .= 0.0
    @inbounds for j = 1:length(x)
        @simd for i = 1:length(y)
            y[i] += A[i,j] * x[j]
        end
    end
end

The inner loop requires loading each element of y, and a column of A. Every element of y therefore gets loaded length(x) times. That is the #1 problem.
If the inner loop isn’t unrolled (beyond vectorizing), each iteration of the inner loop will do the following:

  1. Load a vector from A.
  2. Load a vector from y.
  3. update y with an fma instruction: ynew = A * x + yold
  4. store ynew.

Most modern x86_64 CPUs can perform up to 2 aligned vector loads / clock cycle, or just 1 load the crosses a 64 byte boundary. With avx2 (32 bytes), half of unaligned loads would cross this boundary (unless the loads happen to be aligned by chance). With avx512 (64 byte loads), all would cross that boundary, unless they’re aligned by chance*.

  • In Julia, a small vector of doubles is aligned to 8 byte boundaries, meaning 1/4 chance that they’ll be aligned to a 32 byte boundary (guaranteeing avx2 never crosses 64 byte boundaries), and 1/8 for avx512. Big arrays will be 64-byte aligned.

These CPUs can also perform up to 2 fma (fused-multiply-add) instructions / clock cycle.
But, above, we already wrote down that they’re doing 2 loads per 1 fma, so the CPU will hit the cap on the number of loads, and can do at most a single fma per clock cycle (less if the loads cross those 64-byte alignment boundaries).

With that in mind, how can we speed up your mul?
Let’s avoid loading and storing into y as much as possible. Note that we’ll still want to do several vectors of y at a time, because each iteration of the j loop depends on the former; we don’t want to be bottle necked by instruction latency either.

You’ll need to pick an unroll factor.
4 * SIMD width is a good choice. For avx2 with double precision, SIMD width is 4, so I use 16 here (although I’m on a computer with 32 is actually a better choice in general).

function mymul!(y, A, x)
    y .= 0.0
    N = length(y)
    Nrep, Nrem = divrem(N, 16)
    ioffset = 0
    @inbounds for _ = 1:Nrep
        Base.Cartesian.@nexprs 16 i -> y_i = y[i+ioffset]
        for j = 1:length(x)
            Base.Cartesian.@nexprs 16 i -> y_i = muladd(A[i+ioffset,j], x[j], y_i)
        end
        Base.Cartesian.@nexprs 16 i -> y[i+ioffset] = y_i
        ioffset += 16
    end
    @inbounds for j = 1:length(x)
        for i = 1:Nrem
            y[i+ioffset] += A[i+ioffset,j] * x[j]
        end
    end
end

Let’s time it

julia> @benchmark mymul!($ya, $Aa, $xa)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     257.567 ns (0.00% GC)
  median time:      271.970 ns (0.00% GC)
  mean time:        284.745 ns (0.00% GC)
  maximum time:     405.791 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     349

vs for your original version:

julia> @benchmark mymuloriginal!($ya, $Aa, $xa)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     699.648 ns (0.00% GC)
  median time:      703.262 ns (0.00% GC)
  mean time:        733.009 ns (0.00% GC)
  maximum time:     4.549 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     145

So we’re a lot closer to BLAS, but still way behind PaddedMatrices.
To try and at least beat BLAS, lets be more clever about how we handle the remainder.

I’ll also get a little silly with some other optimizations, namely note that

julia> 117 >>> 2 == 117 ÷ 4
true

julia> 117 & 3 == 117 % 4
true

These are a faster way to calculate the number of repetitions and remainders we need to break down the calculation into successively smaller chunks, but they only work for powers of 2. We’ll go for three chunks:

  1. A fast chunk, of 4W. We’ll do length(y) ÷ (4W) of these.
  2. Chunks of size W. We’ll do (length(y) % (4W)) ÷ W of these.
  3. The remainder. We’ll do 1 of these if length(y) % W > 0, 0 otherwise.

We do the first fast chunk in batches of 4, because it takes a few clock cycles (4 on my computer) for an fma instruction to actually complete. Because each iteration j in your loop depends on the previous, we’d have to wait for one to complete before starting the other. By having 4 going on at the same time, we’ll do a lot less waiting.
FLOPS will actually theoretically peak at 8 concurrent fma instructions (2 fma/clock cycle * 4 cycle latency), but 8W can be a big number. On my computer, W is actually 8, so 8W == 64, which means that with a length(y) == 50 like in this example, I wouldn’t be doing the fast iteration at all. And not doing the fast iteration is slow, not fast.
Another advantage is that the bigger this number, the more we get to reuse a value we load from x. So this is a number you could try playing with.

I’ll also add two dependencies that you can install with:

using Pkg
Pkg.add(PackageSpec(url="https://github.com/chriselrod/VectorizationBase.jl"))
Pkg.add(PackageSpec(url="https://github.com/chriselrod/SIMDPirates.jl"))
using VectorizationBase, SIMDPirates
const W, Wshift = VectorizationBase.pick_vector_width_shift(Float64) # 8, 3 on this computer

Note that Wshift = log2(W).
This will make it easier to explicitly do the batches of size W.

function mymulfinal!(y, A, x)
    N = length(y)
    Nrep = N >>> (Wshift + 2)
    Nrem = N & (4W - 1)
    ioffset = 0
    GC.@preserve y A begin # protect A and y from GC
    vptrA = vectorizable(A); vptry = vectorizable(y)
    # We calculate 4W at a time
    for _ = 1:Nrep
        Base.Cartesian.@nexprs 4 i -> vy_i = vbroadcast(Vec{W,Float64}, 0.0)
        for j = 1:length(x)
            @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
            Base.Cartesian.@nexprs 4 i -> begin
                 vA_i = vload(Vec{W,Float64}, vptrA, W*(i-1) + ioffset + (j-1)*stride(A,2))
                 vy_i = vmuladd(vA_i, vx, vy_i)
            end
        end
        Base.Cartesian.@nexprs 4 i -> vstore!(vptry, vy_i, W*(i-1) + ioffset)
        ioffset += 4W
    end
    # Now we do W at a time
    Nrep = Nrem >>> Wshift
    Nrem &= (W - 1)
    for _ = 1:Nrep
        vy = vbroadcast(Vec{W,Float64}, 0.0)
        for j = 1:length(x)
            @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
             vA = vload(Vec{W,Float64}, vptrA, ioffset + (j-1)*stride(A,2))
             vy = vmuladd(vA, vx, vy)
        end
        vstore!(vptry, vy, ioffset)
        ioffset += W
    end
    Nrem == 0 && return y
    mask = VectorizationBase.mask(Float64, Nrem)
    vy = vbroadcast(Vec{W,Float64}, 0.0)
    for j = 1:length(x)
        @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
         vA = vload(Vec{W,Float64}, vptrA, ioffset + (j-1)*stride(A,2), mask)
         vy = vmuladd(vA, vx, vy)
    end
    vstore!(vptry, vy, ioffset, mask)
    end # GC.@preserve
    y
end

A few explanations:

  • The vload and vstore! functions operate on pointers, which is why we need GC.@preserve. Julia’s garbage collector does not track pointers, and we don’t want to risk it ever thinking we’re not using A or y anymore, while we’re using pointers to their memory.
  • The vectorizable function actually returns a wrapper to a pointer that supports “pointer arithmetic”, so that we don’t have to multiply by the number of bytes.
  • vbroadcast creates a vector where every element is the argument.
  • vload loads a vector of consecutive elements. If we give it a mask, it’ll only load the elements where the mask is set to 1. This lets us avoid reading past the end of the vector / a matrix.
  • vstore! is similar, but for storing a vector.
  • Finally, we stop loading from y altogether, so that we also don’t have to set it to zero.

Benchmarking:

julia> @benchmark mymulfinal!($ya, $Aa, $xa)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     181.564 ns (0.00% GC)
  median time:      181.800 ns (0.00% GC)
  mean time:        185.280 ns (0.00% GC)
  maximum time:     384.583 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     690

julia> using LinearAlgebra; BLAS.vendor()
:mkl

Just edging out MKL (where we had 192ns minimum). :slight_smile:

60 Likes

Looking at the source code of LinearAlgebra.dot(x, A, y) (in Julia master), it doesn’t look as complicated and yet it seems fast enough. Are the memory load issues that you describe irrelevant in this case @Elrod?

I doubt I’m the only person thinking this, but I really wish I had $1 million to buy a few years of Chris Elrod and Yingbo Ma’s time, so that they could write a BLAS library that can compete with MKL.

The funny thing is that it actually seems realistic to think that 2 really smart developers using Julia can compete with an entire team funded by Intel. :smiley:

12 Likes

They’re not irrelevant. Adapting the earlier code to do a quadratic form:

using VectorizationBase, SIMDPirates
const W, Wshift = VectorizationBase.pick_vector_width_shift(Float64)
function quadform(y, A, x)
    N = length(y)
    Nrep = N >>> (Wshift + 2)
    Nrem = N & (4W - 1)
    ioffset = 0
    Base.Cartesian.@nexprs 4 i-> s_i = vbroadcast(Vec{W,Float64}, 0.0)
    GC.@preserve y A begin # protect A and y from GC
        vptrA = vectorizable(A); vptry = vectorizable(y)
        # We calculate 4W at a time
        for _ = 1:Nrep
            Base.Cartesian.@nexprs 4 i -> vy_i = vload(Vec{W,Float64}, vptry, W*(i-1) + ioffset)
            for j = 1:length(x)
                @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
                Base.Cartesian.@nexprs 4 i -> begin
                    vA_i = vload(Vec{W,Float64}, vptrA, W*(i-1) + ioffset + (j-1)*stride(A,2))
                    s_i = vmuladd(vA_i, vmul(vx, vy_i), s_i)
                end
            end
            ioffset += 4W
        end
        # Now we do W at a time
        Nrep = Nrem >>> Wshift
        Nrem &= (W - 1)
        for _ = 1:Nrep
            vy = vload(Vec{W,Float64}, vptry, ioffset)
            for j = 1:length(x)
                @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
                vA = vload(Vec{W,Float64}, vptrA, ioffset + (j-1)*stride(A,2))
                s_1 = vmuladd(vA, vmul(vx, vy), s_1)
            end
            ioffset += W
        end
        Nrem == 0 && return y
        mask = VectorizationBase.mask(Float64, Nrem)
        vy = vload(Vec{W,Float64}, vptry, ioffset, mask)
        for j = 1:length(x)
            @inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
            vA = vload(Vec{W,Float64}, vptrA, ioffset + (j-1)*stride(A,2), mask)
            s_2 = vmuladd(vA, vmul(vx, vy), s_2)
        end
    end # GC.@preserve
    s_1 = SIMDPirates.evadd(s_1, s_2) # evadd doesn't pass flag
    s_3 = SIMDPirates.evadd(s_3, s_4) # I don't want these ops reordered
    vsum(vadd(s_1, s_3))
end

Benchmarking:

julia> using LinearAlgebra, BenchmarkTools

julia> N = 50;

julia> A = randn(N,N); x = randn(N); y = randn(N);

julia> @btime dot($y, $A, $x)
  1.147 μs (0 allocations: 0 bytes)
-14.992906756397531

julia> @btime quadform($y, $A, $x)
  324.764 ns (0 allocations: 0 bytes)
-14.992906756397531

julia> @benchmark dot($y, $A, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.140 μs (0.00% GC)
  median time:      1.149 μs (0.00% GC)
  mean time:        1.153 μs (0.00% GC)
  maximum time:     2.312 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark quadform($y, $A, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     324.772 ns (0.00% GC)
  median time:      326.015 ns (0.00% GC)
  mean time:        327.348 ns (0.00% GC)
  maximum time:     498.465 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     228

We get an over 3x improvement!

Would be nice! It’s worth pointing out that Intel has a lot of good developers too. Intel hired Goto, after all.
I’m not aiming at MKL, in large part because MKL is on top of the fruit-tree. Practically any other fruit will be lower. MKL will also be impossible to beat in terms of compile times, because it is precompiled.
Wins could of course come in flexibility, fusing operations and compile-time kernel generation.

I have some other obligations at work, but aside from all of IT’s hurdles, they give me a lot of freedom to spend time on Bayesian-statistics-related software development.
So I’m aiming at Hamiltonian Monte Carlo, where, for example, many functions in Stan are easy to beat many times over.
Currently however, I’m reworking LoopVectorization, where the plan is to make it handle variable numbers of loops, while considering both unrolling and tiling.
It should be able to recreate the simple matrix routines I have in PaddedMatrices.jl currently.
Eventually, it would be neat to both (a) combine it with something like Einsum.jl and (b) have it also consider packing.
My more immediate priority though is to use it for broadcasting PaddedMatrices, and add autodiff support so that they allow broadcasting and maybe some simple loops in ProbabilityModels, making the library much more flexible. Once I finish that, I’ll start working on tagging releases / registering my libraries.

11 Likes

Wow, that’s amazing!

It would be a good idea to add this code to LinearAlgebra.dot!

Of course Intel has lots of good developers, and far be it from me to try to figure out who has more skill. My comment was mainly that it says quite a bit about the quality of Julia as a tool for doing this sort of thing productively, that I think you’d still be competitive with an Intel team that has to do it in C. (Or whatever they use to write MKL)