Moore's Law is deader than corduroy bell bottoms. But with a bit of smart coding it's not the end of the road

johnh · June 8, 2020, 10:09am

I rather like my corduroy bellbottom trousers…

I can’t read the paper. It says 7 hours to run that Python code - and curiously longer in Python3.

Tamas_Papp · June 8, 2020, 12:00pm

The article describes the scientists as “boffins” — unfortunately, only at the end, otherwise I would have known to stop reading earlier.

johnh · June 8, 2020, 12:47pm

Tamas, that is the style of Register articles. They have a lot of long running in-the-know humour.
Please dont be offended by it,

Tamas_Papp · June 8, 2020, 1:23pm

I am not offended at all, I just find the article uninformative.

lungben · June 8, 2020, 6:24pm

The timings in the article are strange:
7 hours for Python (is not implausible, I am not patient enough to repeat the test) and 47x times faster in C - this would still be 9 minutes!
Even my very naive implementation of matrix multiplication in Julia takes only 18s on my laptop, the default OpenBlas * less than 1s (Float32) / 2s (Float64).
Also the GPU timings look strange: in the article a high-end GPU tool 70ms, in my test my mediocre notebook GPU took 2.5 μs (using Float32).
Or did I miss something?

github.com

lungben/julia_experiments/blob/master/notebooks/mmult_benchmark.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "using BenchmarkTools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "using CUDA"
   ]
  },

This file has been truncated. show original

johnh · June 8, 2020, 6:54pm

I agree with you. I find it curious that Python 3 is slower that Python2 also

Elrod · June 8, 2020, 7:20pm

mymult_avx2 is still slower than single threaded OpenBLAS for me, but is about 7 times faster than mymult_avx1 on my computer:

using LoopVectorization
function mymult_avxcore!(C, A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    @avx for n=1:size(B, 2), k=1:size(A,2), m=1:size(A, 1)
        C[m,n] += A[m,k]*B[k,n]
    end
end
function matmul_params(::Type{T}) where {T}
    L₁, L₂, L₃ = LoopVectorization.VectorizationBase.CACHE_SIZE
    W = LoopVectorization.VectorizationBase.pick_vector_width(T)
    L₁ratio = L₁ ÷ (LoopVectorization.nᵣ * sizeof(T))
    kc = round(Int, 0.65L₁ratio)
    mcrep =  5L₂ ÷ (8kc * sizeof(T) * LoopVectorization.mᵣ * W)
    ncrep = L₃ ÷ (kc * sizeof(T) * LoopVectorization.nᵣ)
    mc = mcrep * LoopVectorization.mᵣ * W
    nc = round(Int, 0.4ncrep * LoopVectorization.nᵣ)
    mc, kc, nc
end
function mymult_avx2(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    K = size(B, 1)
    size(A, 2) == K || error("inner dims must match")
    M = size(A,1); N =  size(B,2)
    C = zeros(T, M, N)
    mstep, kstep, nstep = matmul_params(T)
    Abuffer = Matrix{T}(undef, mstep, kstep)
    mstep -= 1; kstep -= 1; nstep -= 1;
    n = 1
    while n <= N
        nnext = min(N, n + nstep)
        k = 1
        while k <= K
            knext = min(K, k + kstep)
            m = 1
            while m <= M
                mnext = min(M, m + mstep)
                Abufferview = @view(Abuffer[1:mnext+1-m,1:knext+1-k])
                copyto!(Abufferview, @view(A[m:mnext,k:knext]))
                @views mymult_avxcore!(C[m:mnext,n:nnext], Abufferview, B[k:knext,n:nnext])
                m = mnext + 1
            end
            k = knext + 1
        end
        n = nnext + 1
    end 
    C
end

You could also use Threads.@spawn in the m or n loops to speed it up further.

LoopVectorization does not currently perform any memory-based optimizations, so you need to do this manually (as above) to get good performance for large arrays.

EDIT:
This version is about the same fast for square Float64 matrices of around 4000-per-dim, but should be faster for larger matrices (and slower for smaller ones):

function mymult_avx3(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    K = size(B, 1)
    size(A, 2) == K || error("inner dims must match")
    M = size(A,1); N =  size(B,2)
    C = zeros(T, M, N)
    mstep, kstep, nstep = matmul_params(T)
    Abuffer = Matrix{T}(undef, mstep, kstep)
    Bbuffer = Matrix{T}(undef, kstep, nstep)
    mstep -= 1; kstep -= 1; nstep -= 1;
    n = 1
    while n <= N
        nnext = min(N, n + nstep)
        k = 1
        while k <= K
            knext = min(K, k + kstep)
            Bbufferview = @view(Bbuffer[1:knext+1-k,1:nnext+1-n])
            copyto!(Bbufferview, @view(B[k:knext,n:nnext]))
            m = 1
            while m <= M
                mnext = min(M, m + mstep)
                Abufferview = @view(Abuffer[1:mnext+1-m,1:knext+1-k])
                copyto!(Abufferview, @view(A[m:mnext,k:knext]))
                mymult_avxcore!(@view(C[m:mnext,n:nnext]), Abufferview, Bbufferview)
                m = mnext + 1
            end
            k = knext + 1
        end
        n = nnext + 1
    end 
    C
end

A threaded version:

function mymult_avx2_threaded(A:: AbstractArray{T,2}, B:: AbstractArray{T,2}) where {T}
    K = size(B, 1)
    size(A, 2) == K || error("inner dims must match")
    M = size(A,1); N =  size(B,2)
    C = zeros(T, M, N)
    mstep, kstep, nstep = matmul_params(T)
    Threads.@sync begin; for _n in 0:cld(N,nstep)-1
        Threads.@spawn begin
        n = min(N, _n * nstep + 1)
        nnext = min(N, n + nstep - 1)
        k = 1
        while k <= K
            knext = min(K, k + kstep - 1)
            Threads.@sync begin; for _m in 0:cld(M, mstep)-1
                Threads.@spawn begin
                    m = min(M, _m*mstep + 1)
                    mnext = min(M, m + mstep - 1)
                    Aslice = A[m:mnext,k:knext]
                    @views mymult_avxcore!(C[m:mnext,n:nnext], Aslice, B[k:knext,n:nnext])
            end; end; end # spawn, for, sync
            k = knext + 1
        end # k
    end; end; end # spawn, for, sync
    C
end

Note of course that these are rough implementations. Like, we’re just taking slices of A above.
These yield

julia> using LinearAlgebra, BenchmarkTools; Threads.nthreads()
18

julia> M = K = N = 4439;

julia> A = rand(M,K); B = rand(K,N);

julia> @benchmark mymult_avx2($A, $B)
BenchmarkTools.Trial:
  memory estimate:  150.93 MiB
  allocs estimate:  4
  --------------
  minimum time:     1.859 s (0.00% GC)
  median time:      1.869 s (0.00% GC)
  mean time:        1.974 s (0.02% GC)
  maximum time:     2.192 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

julia> BLAS.set_num_threads(1);

julia> @benchmark $A * $B
BenchmarkTools.Trial:
  memory estimate:  150.34 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.513 s (0.00% GC)
  median time:      1.515 s (0.00% GC)
  mean time:        1.519 s (0.02% GC)
  maximum time:     1.535 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark mymult_avx2_threaded($A, $B)
BenchmarkTools.Trial:
  memory estimate:  451.76 MiB
  allocs estimate:  13700
  --------------
  minimum time:     153.235 ms (0.00% GC)
  median time:      168.206 ms (0.00% GC)
  mean time:        176.078 ms (1.30% GC)
  maximum time:     250.434 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

julia> BLAS.set_num_threads(18);

julia> @benchmark $A * $B
BenchmarkTools.Trial:
  memory estimate:  150.34 MiB
  allocs estimate:  2
  --------------
  minimum time:     151.265 ms (0.00% GC)
  median time:      169.038 ms (0.00% GC)
  mean time:        173.594 ms (0.29% GC)
  maximum time:     228.274 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

Something has gone wrong with my cooling over the past week. The CPU idles 10-15C hotter than normal, and under load 40C+ hotter than it used to.
I have no idea what is wrong. But this explains why OpenBLAS and the threaded version above performed similarly: the CPU throttled thermally to prevent from overheating, and OpenBLAS took around 50% longer than normal.
If things were operating normally, the mymult_avx2_threaded wouldn’t have improved nearly as much.

Elrod · June 8, 2020, 9:53pm

I edited the above post, bumping this thread. So it seemed like I should add something topical:

Moore’s Law is Not Dead according to Jim Keller, coauthor of the x86-64 specification, lead development of Zen, as well as work for Apple, Tesla, and now Intel.

Topic		Replies	Views
What's the problem with this simple multi-thread code? General Usage question	17	1146	March 11, 2022
Multithreaded code on beefy computer runs just as fast as serial code on M1 Mac Performance performance , parallel , differentialequation	41	3717	February 2, 2022
Blog: Using Julia on the HPC Teaching & Outreach blog-post	40	2241	October 10, 2022
Julia matrix-multiplication performance Performance linearalgebra	20	8660	October 30, 2022
How much faster is GPU compare to CPU GPU	16	26671	November 24, 2018

Moore's Law is deader than corduroy bell bottoms. But with a bit of smart coding it's not the end of the road

Related topics