I rather like my corduroy bellbottom trousers…

I can’t read the paper. It says 7 hours to run that Python code - and curiously longer in Python3.

I rather like my corduroy bellbottom trousers…

I can’t read the paper. It says 7 hours to run that Python code - and curiously longer in Python3.

The article describes the scientists as “boffins” — unfortunately, only at the end, otherwise I would have known to stop reading earlier.

Tamas, that is the style of Register articles. They have a lot of long running in-the-know humour.

Please dont be offended by it,

1 Like

I am not offended at all, I just find the article uninformative.

1 Like

The timings in the article are strange:

7 hours for Python (is not implausible, I am not patient enough to repeat the test) and 47x times faster in C - this would still be 9 minutes!

Even my very naive implementation of matrix multiplication in Julia takes only 18s on my laptop, the default OpenBlas `*`

less than 1s (Float32) / 2s (Float64).

Also the GPU timings look strange: in the article a high-end GPU tool 70ms, in my test my mediocre notebook GPU took 2.5 μs (using Float32).

Or did I miss something?

I agree with you. I find it curious that Python 3 is slower that Python2 also

`mymult_avx2`

is still slower than single threaded OpenBLAS for me, but is about 7 times faster than `mymult_avx1`

on my computer:

```
using LoopVectorization
function mymult_avxcore!(C, A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
@avx for n=1:size(B, 2), k=1:size(A,2), m=1:size(A, 1)
C[m,n] += A[m,k]*B[k,n]
end
end
function matmul_params(::Type{T}) where {T}
L₁, L₂, L₃ = LoopVectorization.VectorizationBase.CACHE_SIZE
W = LoopVectorization.VectorizationBase.pick_vector_width(T)
L₁ratio = L₁ ÷ (LoopVectorization.nᵣ * sizeof(T))
kc = round(Int, 0.65L₁ratio)
mcrep = 5L₂ ÷ (8kc * sizeof(T) * LoopVectorization.mᵣ * W)
ncrep = L₃ ÷ (kc * sizeof(T) * LoopVectorization.nᵣ)
mc = mcrep * LoopVectorization.mᵣ * W
nc = round(Int, 0.4ncrep * LoopVectorization.nᵣ)
mc, kc, nc
end
function mymult_avx2(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
K = size(B, 1)
size(A, 2) == K || error("inner dims must match")
M = size(A,1); N = size(B,2)
C = zeros(T, M, N)
mstep, kstep, nstep = matmul_params(T)
Abuffer = Matrix{T}(undef, mstep, kstep)
mstep -= 1; kstep -= 1; nstep -= 1;
n = 1
while n <= N
nnext = min(N, n + nstep)
k = 1
while k <= K
knext = min(K, k + kstep)
m = 1
while m <= M
mnext = min(M, m + mstep)
Abufferview = @view(Abuffer[1:mnext+1-m,1:knext+1-k])
copyto!(Abufferview, @view(A[m:mnext,k:knext]))
@views mymult_avxcore!(C[m:mnext,n:nnext], Abufferview, B[k:knext,n:nnext])
m = mnext + 1
end
k = knext + 1
end
n = nnext + 1
end
C
end
```

You could also use `Threads.@spawn`

in the `m`

or `n`

loops to speed it up further.

LoopVectorization does not currently perform any memory-based optimizations, so you need to do this manually (as above) to get good performance for large arrays.

EDIT:

This version is about the same fast for square `Float64`

matrices of around 4000-per-dim, but should be faster for larger matrices (and slower for smaller ones):

```
function mymult_avx3(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
K = size(B, 1)
size(A, 2) == K || error("inner dims must match")
M = size(A,1); N = size(B,2)
C = zeros(T, M, N)
mstep, kstep, nstep = matmul_params(T)
Abuffer = Matrix{T}(undef, mstep, kstep)
Bbuffer = Matrix{T}(undef, kstep, nstep)
mstep -= 1; kstep -= 1; nstep -= 1;
n = 1
while n <= N
nnext = min(N, n + nstep)
k = 1
while k <= K
knext = min(K, k + kstep)
Bbufferview = @view(Bbuffer[1:knext+1-k,1:nnext+1-n])
copyto!(Bbufferview, @view(B[k:knext,n:nnext]))
m = 1
while m <= M
mnext = min(M, m + mstep)
Abufferview = @view(Abuffer[1:mnext+1-m,1:knext+1-k])
copyto!(Abufferview, @view(A[m:mnext,k:knext]))
mymult_avxcore!(@view(C[m:mnext,n:nnext]), Abufferview, Bbufferview)
m = mnext + 1
end
k = knext + 1
end
n = nnext + 1
end
C
end
```

A threaded version:

```
function mymult_avx2_threaded(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
K = size(B, 1)
size(A, 2) == K || error("inner dims must match")
M = size(A,1); N = size(B,2)
C = zeros(T, M, N)
mstep, kstep, nstep = matmul_params(T)
Threads.@sync begin; for _n in 0:cld(N,nstep)-1
Threads.@spawn begin
n = min(N, _n * nstep + 1)
nnext = min(N, n + nstep - 1)
k = 1
while k <= K
knext = min(K, k + kstep - 1)
Threads.@sync begin; for _m in 0:cld(M, mstep)-1
Threads.@spawn begin
m = min(M, _m*mstep + 1)
mnext = min(M, m + mstep - 1)
Aslice = A[m:mnext,k:knext]
@views mymult_avxcore!(C[m:mnext,n:nnext], Aslice, B[k:knext,n:nnext])
end; end; end # spawn, for, sync
k = knext + 1
end # k
end; end; end # spawn, for, sync
C
end
```

Note of course that these are rough implementations. Like, we’re just taking slices of `A`

above.

These yield

```
julia> using LinearAlgebra, BenchmarkTools; Threads.nthreads()
18
julia> M = K = N = 4439;
julia> A = rand(M,K); B = rand(K,N);
julia> @benchmark mymult_avx2($A, $B)
BenchmarkTools.Trial:
memory estimate: 150.93 MiB
allocs estimate: 4
--------------
minimum time: 1.859 s (0.00% GC)
median time: 1.869 s (0.00% GC)
mean time: 1.974 s (0.02% GC)
maximum time: 2.192 s (0.00% GC)
--------------
samples: 3
evals/sample: 1
julia> BLAS.set_num_threads(1);
julia> @benchmark $A * $B
BenchmarkTools.Trial:
memory estimate: 150.34 MiB
allocs estimate: 2
--------------
minimum time: 1.513 s (0.00% GC)
median time: 1.515 s (0.00% GC)
mean time: 1.519 s (0.02% GC)
maximum time: 1.535 s (0.00% GC)
--------------
samples: 4
evals/sample: 1
julia> @benchmark mymult_avx2_threaded($A, $B)
BenchmarkTools.Trial:
memory estimate: 451.76 MiB
allocs estimate: 13700
--------------
minimum time: 153.235 ms (0.00% GC)
median time: 168.206 ms (0.00% GC)
mean time: 176.078 ms (1.30% GC)
maximum time: 250.434 ms (0.00% GC)
--------------
samples: 29
evals/sample: 1
julia> BLAS.set_num_threads(18);
julia> @benchmark $A * $B
BenchmarkTools.Trial:
memory estimate: 150.34 MiB
allocs estimate: 2
--------------
minimum time: 151.265 ms (0.00% GC)
median time: 169.038 ms (0.00% GC)
mean time: 173.594 ms (0.29% GC)
maximum time: 228.274 ms (0.00% GC)
--------------
samples: 29
evals/sample: 1
```

Something has gone wrong with my cooling over the past week. The CPU idles 10-15C hotter than normal, and under load 40C+ hotter than it used to.

I have no idea what is wrong. But this explains why OpenBLAS and the threaded version above performed similarly: the CPU throttled thermally to prevent from overheating, and OpenBLAS took around 50% longer than normal.

If things were operating normally, the `mymult_avx2_threaded`

wouldn’t have improved nearly as much.

3 Likes

I edited the above post, bumping this thread. So it seemed like I should add something topical:

Moore’s Law is Not Dead according to Jim Keller, coauthor of the x86-64 specification, lead development of Zen, as well as work for Apple, Tesla, and now Intel.

3 Likes