The first thing I would note here is the benchmark:

Notice how performance plummets beyond around `60`

x`60`

?

This is because LoopVectorization does register tiling, but does not yet do cache tiling. I plan for this to change in the future, but for now, that means we must do the cache-tiling ourselves.

What does ātilingā mean? It means operating block-wise to get data re-use. āCache-tilingā means re-use in the CPUās cache, and āregister-tilingā means re-use in the CPU-registers.

```
@inbounds for j in 1:size(A,2)
@simd for i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end
end
```

When register-tiling here, note that as the loop above loop is written, we need to:

- load
`x[j]`

a total of `size(A,2)`

times ā minimum possible.
- load
`A[i,j]`

a total of `size(A,1) * size(A,2)`

times ā minimum possible.
- load and store
`y[i]`

a total of `size(A,1) * size(A,2)`

times ā an excess factor of `size(A,2)`

.

Register tiling will attack that third point. To investigate what exactly it does:

```
julia> using LoopVectorization
julia> M = N = 500;
julia> y = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N);
julia> ls = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#640"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))
julia> LoopVectorization.choose_order(ls)
([:i, :j], :j, :i, :i, 1, 8)
```

The vector `[:i, :j]`

gives the order of the loops from outer-most to inner-most. That is, it placed the `i`

loop as outer-most, while the `j`

loop is inner-most.

The next two symbols give which loops are unrolled by factors corresponding to the integers:

`j`

is unrolled by `1`

x (i.e., it isnāt), and `i`

by `8`

x.

Given that `j`

is now the innermost loop, not much sense in unrolling it.

The last symbol refers to which loop is actually vectorized.

Note that this can change as a function of types:

```
julia> B = A';
julia> lst = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += B[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#656"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))
julia> LoopVectorization.choose_order(lst) # j is now vectorized instead of `i`
([:i, :j], :j, :i, :j, 1, 8)
```

Anyway, back to the original `([:i, :j], :j, :i, :i, 1, 8)`

example,

We still load from `A`

for each combination. We were already at the minimum, so no savings there.

As `i`

is now the outer-most loop, we now load and store from `y[i]`

a total of `size(A,1)`

times.

However, we do need to load from `x[j]`

, but by `cld(size(A,1),8)*size(A,2)`

. We get a reducing factor of `8`

, because we get to re-use the load from `x[j]`

for each factor by which the `i`

loop is unrolled. `8`

x unrolling means we get an `8`

x reduction in how many times we have to load from this vector!

We can see some spectacular performance improvements from this:

```
julia> function gemvavx1!(y, A, x)
@avx for j in axes(A,2), i in axes(A,1)
y[i] += A[i,j] * x[j]
end
end
gemvavx1! (generic function with 1 method)
julia> function gemvavx2!(y, A, x)
@avx for i in axes(A,1)
yįµ¢ = zero(eltype(y))
for j ā axes(A,2)
yįµ¢ += A[i,j] * x[j]
end
y[i] = yįµ¢
end
end
gemvavx2! (generic function with 1 method)
julia> function gemv_simd!(y, A, x)
@inbounds for j in axes(A,2); @simd ivdep for i in axes(A,1)
y[i] += A[i,j] * x[j]
end; end
end
gemv_simd! (generic function with 1 method)
julia> M = N = 56;
julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);
julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 862.015 ns (0.00% GC)
median time: 867.697 ns (0.00% GC)
mean time: 867.474 ns (0.00% GC)
maximum time: 1.537 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 66
julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 92.937 ns (0.00% GC)
median time: 93.351 ns (0.00% GC)
mean time: 93.472 ns (0.00% GC)
maximum time: 147.397 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 954
julia> @benchmark gemvavx2!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 76.973 ns (0.00% GC)
median time: 77.107 ns (0.00% GC)
mean time: 77.198 ns (0.00% GC)
maximum time: 130.935 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 969
julia> y1 ā y2 ā y3
true
```

But ā only at these tiny sizes.

Your CPU is way faster at computing than it is at moving memory around.

And `A * x`

requires on the order of `length(A)`

operations, and on the same order of memory. So in the end, itāll perform more or less the same as `sum`

:

```
julia> function sumavx(A)
s = zero(eltype(A))
@avx for i ā eachindex(A)
s += A[i]
end
s
end
sumavx (generic function with 1 method)
julia> @btime sum($A)
192.643 ns (0 allocations: 0 bytes)
1578.41421948359
julia> @btime sumavx($A)
67.960 ns (0 allocations: 0 bytes)
1578.41421948359
```

Since `A * x`

and `sum(A)`

require accessing roughly the same amount of memory.

And once these vectors stop fitting in our CPUās caches, memory bandwidth is all that determines performance. Make `A`

`500`

x`500`

, and `gemv_simd!`

, which was so slow before, is now equally fast as everything else:

```
julia> M = N = 500;
julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);
julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 82.602 Ī¼s (0.00% GC)
median time: 85.719 Ī¼s (0.00% GC)
mean time: 85.867 Ī¼s (0.00% GC)
maximum time: 168.723 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 82.080 Ī¼s (0.00% GC)
median time: 84.833 Ī¼s (0.00% GC)
mean time: 84.985 Ī¼s (0.00% GC)
maximum time: 257.037 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemvavx2!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 83.944 Ī¼s (0.00% GC)
median time: 84.750 Ī¼s (0.00% GC)
mean time: 84.895 Ī¼s (0.00% GC)
maximum time: 258.025 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> y1 ā y2 ā y3
true
julia> @btime sumavx($A)
83.417 Ī¼s (0 allocations: 0 bytes)
125042.8745667071
julia> @btime sumsimd($A)
83.430 Ī¼s (0 allocations: 0 bytes)
125042.87456670718
```

So the best you can do is add multithreading and break up the `i`

vector (you canāt parallelize it, or youāre be reading and writing to `y`

with different threads simultaneously.

```
function gemv_threads!(y, A, x)
M = length(y)
chunks = min(M >> 7, Base.Threads.nthreads())
chunks > 1 || return gemvavx2!(y, A, x)
chunk_size = LoopVectorization.VectorizationBase.align(cld(M, chunks), eltype(y))
Base.Threads.@sync for c in 1:chunks
Threads.@spawn begin
b = chunk_size*(c-1) + 1
e = min(M, chunk_size*c)
r = b:e
gemvavx2!(view(y, r), view(A, r, :), x)
end
end
y
end
```

With this, for 500x500, I get

```
julia> @benchmark gemv_threads!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 2.09 KiB
allocs estimate: 30
--------------
minimum time: 10.343 Ī¼s (0.00% GC)
median time: 35.046 Ī¼s (0.00% GC)
mean time: 35.778 Ī¼s (0.00% GC)
maximum time: 164.298 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> using LinearAlgebra; BLAS.set_num_threads(Base.Threads.nthreads());
julia> @benchmark mul!($y2, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.815 Ī¼s (0.00% GC)
median time: 15.596 Ī¼s (0.00% GC)
mean time: 15.964 Ī¼s (0.00% GC)
maximum time: 82.304 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> y2 ā y3
true
```

Peak performance is similar with OpenBLAS here, but average performance is much worse.

As @stillyslalom said, it does itās own modeling to make decisions and then generates code with lots of `llvmcall`

.