Apologies for all those deleted posts. I keep accidentally hotkeying post before Iâ€™m actually done. =/
We can actually get much faster with a pure Julia implementation, at least if we use statically sized arrays:
julia> using PaddedMatrices, BenchmarkTools
julia> A = @Mutable randn(50,50);
julia> x = @Mutable randn(50);
julia> y = FixedSizeVector{50,Float64}(undef);
julia> Aa = Array(A); xa = Array(x); ya = Array(y);
julia> @benchmark mul!($y, $A, $x) #mul! defined in PaddedMatrices
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0

minimum time: 58.239 ns (0.00% GC)
median time: 58.377 ns (0.00% GC)
mean time: 61.044 ns (0.00% GC)
maximum time: 95.340 ns (0.00% GC)

samples: 10000
evals/sample: 983
julia> @benchmark mul!($ya, $Aa, $xa) # LinearAlgebra.mul!; BLAS
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0

minimum time: 192.109 ns (0.00% GC)
median time: 202.958 ns (0.00% GC)
mean time: 210.614 ns (0.00% GC)
maximum time: 2.324 ÎĽs (0.00% GC)

samples: 10000
evals/sample: 641
The key advantage at these small sizes is not cache efficiency, but increasing the number % of time spent computing (fma
instructions) relative to moving memory in and out or registers.
Lets look at your code:
function mymuloriginal!(y, A, x)
y .= 0.0
@inbounds for j = 1:length(x)
@simd for i = 1:length(y)
y[i] += A[i,j] * x[j]
end
end
end
The inner loop requires loading each element of y
, and a column of A
. Every element of y
therefore gets loaded length(x)
times. That is the #1 problem.
If the inner loop isnâ€™t unrolled (beyond vectorizing), each iteration of the inner loop will do the following:
 Load a vector from A.
 Load a vector from y.
 update y with an fma instruction: ynew = A * x + yold
 store ynew.
Most modern x86_64 CPUs can perform up to 2 aligned vector loads / clock cycle, or just 1 load the crosses a 64 byte boundary. With avx2 (32 bytes), half of unaligned loads would cross this boundary (unless the loads happen to be aligned by chance). With avx512 (64 byte loads), all would cross that boundary, unless theyâ€™re aligned by chance*.
 In Julia, a small vector of doubles is aligned to 8 byte boundaries, meaning 1/4 chance that theyâ€™ll be aligned to a 32 byte boundary (guaranteeing avx2 never crosses 64 byte boundaries), and 1/8 for avx512. Big arrays will be 64byte aligned.
These CPUs can also perform up to 2 fma (fusedmultiplyadd) instructions / clock cycle.
But, above, we already wrote down that theyâ€™re doing 2 loads per 1 fma, so the CPU will hit the cap on the number of loads, and can do at most a single fma per clock cycle (less if the loads cross those 64byte alignment boundaries).
With that in mind, how can we speed up your mul?
Letâ€™s avoid loading and storing into y
as much as possible. Note that weâ€™ll still want to do several vectors of y
at a time, because each iteration of the j
loop depends on the former; we donâ€™t want to be bottle necked by instruction latency either.
Youâ€™ll need to pick an unroll factor.
4 * SIMD width is a good choice. For avx2 with double precision, SIMD width is 4, so I use 16 here (although Iâ€™m on a computer with 32 is actually a better choice in general).
function mymul!(y, A, x)
y .= 0.0
N = length(y)
Nrep, Nrem = divrem(N, 16)
ioffset = 0
@inbounds for _ = 1:Nrep
Base.Cartesian.@nexprs 16 i > y_i = y[i+ioffset]
for j = 1:length(x)
Base.Cartesian.@nexprs 16 i > y_i = muladd(A[i+ioffset,j], x[j], y_i)
end
Base.Cartesian.@nexprs 16 i > y[i+ioffset] = y_i
ioffset += 16
end
@inbounds for j = 1:length(x)
for i = 1:Nrem
y[i+ioffset] += A[i+ioffset,j] * x[j]
end
end
end
Letâ€™s time it
julia> @benchmark mymul!($ya, $Aa, $xa)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0

minimum time: 257.567 ns (0.00% GC)
median time: 271.970 ns (0.00% GC)
mean time: 284.745 ns (0.00% GC)
maximum time: 405.791 ns (0.00% GC)

samples: 10000
evals/sample: 349
vs for your original version:
julia> @benchmark mymuloriginal!($ya, $Aa, $xa)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0

minimum time: 699.648 ns (0.00% GC)
median time: 703.262 ns (0.00% GC)
mean time: 733.009 ns (0.00% GC)
maximum time: 4.549 ÎĽs (0.00% GC)

samples: 10000
evals/sample: 145
So weâ€™re a lot closer to BLAS, but still way behind PaddedMatrices
.
To try and at least beat BLAS, lets be more clever about how we handle the remainder.
Iâ€™ll also get a little silly with some other optimizations, namely note that
julia> 117 >>> 2 == 117 Ă· 4
true
julia> 117 & 3 == 117 % 4
true
These are a faster way to calculate the number of repetitions and remainders we need to break down the calculation into successively smaller chunks, but they only work for powers of 2. Weâ€™ll go for three chunks:
 A fast chunk, of 4W. Weâ€™ll do
length(y) Ă· (4W)
of these.
 Chunks of size W. Weâ€™ll do
(length(y) % (4W)) Ă· W
of these.
 The remainder. Weâ€™ll do 1 of these if
length(y) % W > 0
, 0 otherwise.
We do the first fast chunk in batches of 4, because it takes a few clock cycles (4 on my computer) for an fma
instruction to actually complete. Because each iteration j
in your loop depends on the previous, weâ€™d have to wait for one to complete before starting the other. By having 4 going on at the same time, weâ€™ll do a lot less waiting.
FLOPS will actually theoretically peak at 8 concurrent fma instructions (2 fma/clock cycle * 4 cycle latency), but 8W can be a big number. On my computer, W
is actually 8, so 8W == 64
, which means that with a length(y) == 50
like in this example, I wouldnâ€™t be doing the fast iteration at all. And not doing the fast iteration is slow, not fast.
Another advantage is that the bigger this number, the more we get to reuse a value we load from x
. So this is a number you could try playing with.
Iâ€™ll also add two dependencies that you can install with:
using Pkg
Pkg.add(PackageSpec(url="https://github.com/chriselrod/VectorizationBase.jl"))
Pkg.add(PackageSpec(url="https://github.com/chriselrod/SIMDPirates.jl"))
using VectorizationBase, SIMDPirates
const W, Wshift = VectorizationBase.pick_vector_width_shift(Float64) # 8, 3 on this computer
Note that Wshift = log2(W).
This will make it easier to explicitly do the batches of size W
.
function mymulfinal!(y, A, x)
N = length(y)
Nrep = N >>> (Wshift + 2)
Nrem = N & (4W  1)
ioffset = 0
GC.@preserve y A begin # protect A and y from GC
vptrA = vectorizable(A); vptry = vectorizable(y)
# We calculate 4W at a time
for _ = 1:Nrep
Base.Cartesian.@nexprs 4 i > vy_i = vbroadcast(Vec{W,Float64}, 0.0)
for j = 1:length(x)
@inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
Base.Cartesian.@nexprs 4 i > begin
vA_i = vload(Vec{W,Float64}, vptrA, W*(i1) + ioffset + (j1)*stride(A,2))
vy_i = vmuladd(vA_i, vx, vy_i)
end
end
Base.Cartesian.@nexprs 4 i > vstore!(vptry, vy_i, W*(i1) + ioffset)
ioffset += 4W
end
# Now we do W at a time
Nrep = Nrem >>> Wshift
Nrem &= (W  1)
for _ = 1:Nrep
vy = vbroadcast(Vec{W,Float64}, 0.0)
for j = 1:length(x)
@inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
vA = vload(Vec{W,Float64}, vptrA, ioffset + (j1)*stride(A,2))
vy = vmuladd(vA, vx, vy)
end
vstore!(vptry, vy, ioffset)
ioffset += W
end
Nrem == 0 && return y
mask = VectorizationBase.mask(Float64, Nrem)
vy = vbroadcast(Vec{W,Float64}, 0.0)
for j = 1:length(x)
@inbounds vx = vbroadcast(Vec{W,Float64}, x[j])
vA = vload(Vec{W,Float64}, vptrA, ioffset + (j1)*stride(A,2), mask)
vy = vmuladd(vA, vx, vy)
end
vstore!(vptry, vy, ioffset, mask)
end # GC.@preserve
y
end
A few explanations:
 The
vload
and vstore!
functions operate on pointers, which is why we need GC.@preserve
. Juliaâ€™s garbage collector does not track pointers, and we donâ€™t want to risk it ever thinking weâ€™re not using A
or y
anymore, while weâ€™re using pointers to their memory.
 The
vectorizable
function actually returns a wrapper to a pointer that supports â€śpointer arithmeticâ€ť, so that we donâ€™t have to multiply by the number of bytes.

vbroadcast
creates a vector where every element is the argument.

vload
loads a vector of consecutive elements. If we give it a mask, itâ€™ll only load the elements where the mask is set to 1
. This lets us avoid reading past the end of the vector / a matrix.

vstore!
is similar, but for storing a vector.
 Finally, we stop loading from
y
altogether, so that we also donâ€™t have to set it to zero.
Benchmarking:
julia> @benchmark mymulfinal!($ya, $Aa, $xa)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0

minimum time: 181.564 ns (0.00% GC)
median time: 181.800 ns (0.00% GC)
mean time: 185.280 ns (0.00% GC)
maximum time: 384.583 ns (0.00% GC)

samples: 10000
evals/sample: 690
julia> using LinearAlgebra; BLAS.vendor()
:mkl
Just edging out MKL (where we had 192ns minimum).