Moore's Law is deader than corduroy bell bottoms. But with a bit of smart coding it's not the end of the road

I rather like my corduroy bellbottom trousers…

I can’t read the paper. It says 7 hours to run that Python code - and curiously longer in Python3.

The article describes the scientists as “boffins” — unfortunately, only at the end, otherwise I would have known to stop reading earlier.

Tamas, that is the style of Register articles. They have a lot of long running in-the-know humour.
Please dont be offended by it,

1 Like

I am not offended at all, I just find the article uninformative.

1 Like

The timings in the article are strange:
7 hours for Python (is not implausible, I am not patient enough to repeat the test) and 47x times faster in C - this would still be 9 minutes!
Even my very naive implementation of matrix multiplication in Julia takes only 18s on my laptop, the default OpenBlas * less than 1s (Float32) / 2s (Float64).
Also the GPU timings look strange: in the article a high-end GPU tool 70ms, in my test my mediocre notebook GPU took 2.5 μs (using Float32).
Or did I miss something?

I agree with you. I find it curious that Python 3 is slower that Python2 also

mymult_avx2 is still slower than single threaded OpenBLAS for me, but is about 7 times faster than mymult_avx1 on my computer:

using LoopVectorization
function mymult_avxcore!(C, A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    @avx for n=1:size(B, 2), k=1:size(A,2), m=1:size(A, 1)
        C[m,n] += A[m,k]*B[k,n]
    end
end
function matmul_params(::Type{T}) where {T}
    L₁, L₂, L₃ = LoopVectorization.VectorizationBase.CACHE_SIZE
    W = LoopVectorization.VectorizationBase.pick_vector_width(T)
    L₁ratio = L₁ ÷ (LoopVectorization.nᵣ * sizeof(T))
    kc = round(Int, 0.65L₁ratio)
    mcrep =  5L₂ ÷ (8kc * sizeof(T) * LoopVectorization.mᵣ * W)
    ncrep = L₃ ÷ (kc * sizeof(T) * LoopVectorization.nᵣ)
    mc = mcrep * LoopVectorization.mᵣ * W
    nc = round(Int, 0.4ncrep * LoopVectorization.nᵣ)
    mc, kc, nc
end
function mymult_avx2(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    K = size(B, 1)
    size(A, 2) == K || error("inner dims must match")
    M = size(A,1); N =  size(B,2)
    C = zeros(T, M, N)
    mstep, kstep, nstep = matmul_params(T)
    Abuffer = Matrix{T}(undef, mstep, kstep)
    mstep -= 1; kstep -= 1; nstep -= 1;
    n = 1
    while n <= N
        nnext = min(N, n + nstep)
        k = 1
        while k <= K
            knext = min(K, k + kstep)
            m = 1
            while m <= M
                mnext = min(M, m + mstep)
                Abufferview = @view(Abuffer[1:mnext+1-m,1:knext+1-k])
                copyto!(Abufferview, @view(A[m:mnext,k:knext]))
                @views mymult_avxcore!(C[m:mnext,n:nnext], Abufferview, B[k:knext,n:nnext])
                m = mnext + 1
            end
            k = knext + 1
        end
        n = nnext + 1
    end 
    C
end

You could also use Threads.@spawn in the m or n loops to speed it up further.

LoopVectorization does not currently perform any memory-based optimizations, so you need to do this manually (as above) to get good performance for large arrays.

EDIT:
This version is about the same fast for square Float64 matrices of around 4000-per-dim, but should be faster for larger matrices (and slower for smaller ones):

function mymult_avx3(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    K = size(B, 1)
    size(A, 2) == K || error("inner dims must match")
    M = size(A,1); N =  size(B,2)
    C = zeros(T, M, N)
    mstep, kstep, nstep = matmul_params(T)
    Abuffer = Matrix{T}(undef, mstep, kstep)
    Bbuffer = Matrix{T}(undef, kstep, nstep)
    mstep -= 1; kstep -= 1; nstep -= 1;
    n = 1
    while n <= N
        nnext = min(N, n + nstep)
        k = 1
        while k <= K
            knext = min(K, k + kstep)
            Bbufferview = @view(Bbuffer[1:knext+1-k,1:nnext+1-n])
            copyto!(Bbufferview, @view(B[k:knext,n:nnext]))
            m = 1
            while m <= M
                mnext = min(M, m + mstep)
                Abufferview = @view(Abuffer[1:mnext+1-m,1:knext+1-k])
                copyto!(Abufferview, @view(A[m:mnext,k:knext]))
                mymult_avxcore!(@view(C[m:mnext,n:nnext]), Abufferview, Bbufferview)
                m = mnext + 1
            end
            k = knext + 1
        end
        n = nnext + 1
    end 
    C
end

A threaded version:

function mymult_avx2_threaded(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    K = size(B, 1)
    size(A, 2) == K || error("inner dims must match")
    M = size(A,1); N =  size(B,2)
    C = zeros(T, M, N)
    mstep, kstep, nstep = matmul_params(T)
    Threads.@sync begin; for _n in 0:cld(N,nstep)-1
        Threads.@spawn begin
        n = min(N, _n * nstep + 1)
        nnext = min(N, n + nstep - 1)
        k = 1
        while k <= K
            knext = min(K, k + kstep - 1)
            Threads.@sync begin; for _m in 0:cld(M, mstep)-1
                Threads.@spawn begin
                    m = min(M, _m*mstep + 1)
                    mnext = min(M, m + mstep - 1)
                    Aslice = A[m:mnext,k:knext]
                    @views mymult_avxcore!(C[m:mnext,n:nnext], Aslice, B[k:knext,n:nnext])
            end; end; end # spawn, for, sync
            k = knext + 1
        end # k
    end; end; end # spawn, for, sync
    C
end

Note of course that these are rough implementations. Like, we’re just taking slices of A above.
These yield

julia> using LinearAlgebra, BenchmarkTools; Threads.nthreads()
18

julia> M = K = N = 4439;

julia> A = rand(M,K); B = rand(K,N);

julia> @benchmark mymult_avx2($A, $B)
BenchmarkTools.Trial:
  memory estimate:  150.93 MiB
  allocs estimate:  4
  --------------
  minimum time:     1.859 s (0.00% GC)
  median time:      1.869 s (0.00% GC)
  mean time:        1.974 s (0.02% GC)
  maximum time:     2.192 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> BLAS.set_num_threads(1);

julia> @benchmark $A * $B
BenchmarkTools.Trial:
  memory estimate:  150.34 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.513 s (0.00% GC)
  median time:      1.515 s (0.00% GC)
  mean time:        1.519 s (0.02% GC)
  maximum time:     1.535 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark mymult_avx2_threaded($A, $B)
BenchmarkTools.Trial:
  memory estimate:  451.76 MiB
  allocs estimate:  13700
  --------------
  minimum time:     153.235 ms (0.00% GC)
  median time:      168.206 ms (0.00% GC)
  mean time:        176.078 ms (1.30% GC)
  maximum time:     250.434 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

julia> BLAS.set_num_threads(18);

julia> @benchmark $A * $B
BenchmarkTools.Trial:
  memory estimate:  150.34 MiB
  allocs estimate:  2
  --------------
  minimum time:     151.265 ms (0.00% GC)
  median time:      169.038 ms (0.00% GC)
  mean time:        173.594 ms (0.29% GC)
  maximum time:     228.274 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

Something has gone wrong with my cooling over the past week. The CPU idles 10-15C hotter than normal, and under load 40C+ hotter than it used to.
I have no idea what is wrong. But this explains why OpenBLAS and the threaded version above performed similarly: the CPU throttled thermally to prevent from overheating, and OpenBLAS took around 50% longer than normal.
If things were operating normally, the mymult_avx2_threaded wouldn’t have improved nearly as much.

3 Likes

I edited the above post, bumping this thread. So it seemed like I should add something topical:

Moore’s Law is Not Dead according to Jim Keller, coauthor of the x86-64 specification, lead development of Zen, as well as work for Apple, Tesla, and now Intel.

3 Likes