Actually, I donāt think this should necessarily be a problem here.

512:

```
julia> using LoopVectorization, BenchmarkTools
julia> function AmulB!(C, A, B)
@avx for m ā axes(C,1), n ā axes(C,2)
Cāā = zero(eltype(C))
for k ā axes(B,1)
Cāā += A[m,k] * B[k,n]
end
C[m,n] = Cāā
end
end
AmulB! (generic function with 1 method)
julia> M, K, N = 512, 72, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 44.239 Ī¼s (0.00% GC)
median time: 44.365 Ī¼s (0.00% GC)
mean time: 44.423 Ī¼s (0.00% GC)
maximum time: 120.546 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 44.239e-6
119.9940324148376
```

vs 520:

```
julia> M, K, N = 520, 72, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.211 Ī¼s (0.00% GC)
median time: 43.345 Ī¼s (0.00% GC)
mean time: 43.409 Ī¼s (0.00% GC)
maximum time: 104.449 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 43.211e-6
124.7682303117262
```

The CPU I ran this on (10980XE) has 8-way associative 32 KiB L1D cache/core

```
julia> 32 * (2 ^ 10) Ć· (8 * 8)
512
```

Yielding the fabled 512 as the magic(-ally bad) number.

But 119 (M = 512) vs 124 (M = 520) GFLOPS isnāt a big deal.

Both are very good; the theoretical peak for this CPU is 131.2.

Agner Fogās example (see section 9.10, starting on page 103) is transposing a matrix.

We can try this:

```
julia> A = rand(512, 512); B = similar(A);
julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 844.904 Ī¼s (0.00% GC)
median time: 850.551 Ī¼s (0.00% GC)
mean time: 850.523 Ī¼s (0.00% GC)
maximum time: 1.722 ms (0.00% GC)
--------------
samples: 5876
evals/sample: 1
julia> A = rand(520, 520); B = similar(A);
julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 465.476 Ī¼s (0.00% GC)
median time: 470.590 Ī¼s (0.00% GC)
mean time: 470.852 Ī¼s (0.00% GC)
maximum time: 873.865 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
```

Tranposing a 512x512 matrix took about 80% longer than transposing a 520x520 matrix on my computer. Ouch!

(The problem is much less severe if you use FastTranspose.jl, where I get 285 microseconds for 512x512 vs 235 for 520x520.)

That is because what were contiguous loads (down a column) from `A`

suddenly become 512 doubles apart when storing into `B`

, so that each one evicts the previous cacheline.

So, to reproduce this in matrix multiplication, we should similarly expect pathological behavior when multiplying column-major `A * B`

without packing into a tile-major format.

This is because (in column-major matmul) on each iteration of the `k`

loop, we load vectors from a column of `A`

, and a bunch of scalars from a row of `B`

. So if `B`

's stride is 512, each of these scalars will be from a different cacheline.

Yet, performance is still fine:

```
julia> M, K, N = 72, 512, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N); C2 = A * B;
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.920 Ī¼s (0.00% GC)
median time: 44.285 Ī¼s (0.00% GC)
mean time: 44.367 Ī¼s (0.00% GC)
maximum time: 120.774 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 43.920e-6
120.86557377049182
```

Thatās 120 GFLOPS. Iām happy to see great performance, but it seems like I need to understand this problem better.

@biona001

My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization?

At that size, why not multithread the BLAS and sum? How does `ThreadsX.sum(A * A')`

or simple `sum(A * A')`

do?

Additionally, do they all really have 512 rows? If so, you can use a preallocated 512 x 512 `C`

. `sum(A * A') ā sum(A' * A)`

, so youāll want to make the smaller dim the outer dim.

Returning to my example, I looked into why the simply written version was slower, and it turns out the problem is with LLVM, so Iāve filed a bug there.

The simpler version gives LoopVectorization more freedom in choosing the order of the loops, in which case it chooses an order more like:

```
function sumAįµA(A::AbstractMatrix{T}) where {T}
s = zero(T)
@avx for n ā axes(A,1), k ā axes(A,2)
t = zero(T)
for m ā axes(A,1)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
```

And this is indeed faster at many sizes than what I had written last time, and manages to workaround that LLVM performance-bug.

Iāll consider trying to automatically implement some workarounds that hopefully have less runtime overhead than doing something like the above, which *should* be slower than the simple

`for m ā axes(A,1), n ā axes(A,1), k ā axes(A,2); s += ...`

version.

I suggest giving it a try; seems to scale well without resorting to extra levels of tiling:

```
julia> A = rand(512, 1000);
julia> @btime sumAįµA($A)
4.047 ms (0 allocations: 0 bytes)
6.5659906453390986e7
julia> 2e-9*512^2*1000 / 4.047e-3
129.54978996787744
```