If a CPU is running at 4 GHz (i.e., 4 clock cycles / ns), then 2ns means 8 clock cycles.
On Skylake-X, an fma
instruction has a latency of 4 clock cycles, and the CPU can execute 2 of them per clock cycle.
Theoretically, that means it could start 8 fma
s within the first nanosecond (4 cycles, 2 per cycle), and then all 8 of these would have finished executing by the end of the 2nd nanosecond.
Given AVX512, that means it could complete 128 Float64
floating point operations in 2 nanoseconds, or 64 Float64
operations on average per nanosecond.
While spending half that time not actually executing instructions at all. Running it for longer can get you closer to that 128/nanosecond:
julia> using LoopVectorization, BenchmarkTools
julia> function AmulB!(C, A, B)
@avx for m ∈ axes(C,1), n ∈ axes(C,2)
Cₘₙ = zero(eltype(C))
for k ∈ axes(B,1)
Cₘₙ += A[m,k] * B[k,n]
end
C[m,n] = Cₘₙ
end
C
end
AmulB! (generic function with 1 method)
julia> M = K = N = 72;
julia> A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.117 μs (0.00% GC)
median time: 6.151 μs (0.00% GC)
mean time: 6.161 μs (0.00% GC)
maximum time: 18.320 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
julia> 2M*K*N / 6117
122.03629230014712
Average of 122 floating point operations completed per nanosecond.
A computer can get a lot done per nanosecond.
But anything under 1/GHz nanoseconds, i.e. less than a clock cycle, is unbelievable.
(Full disclosure, the computer I benchmarked on runs AVX512 code at 4.1 GHz rather than 4.0GHz; 4 is cleaner for explanations)