Performance of creating a tuple with a for loop

If a CPU is running at 4 GHz (i.e., 4 clock cycles / ns), then 2ns means 8 clock cycles.
On Skylake-X, an fma instruction has a latency of 4 clock cycles, and the CPU can execute 2 of them per clock cycle.
Theoretically, that means it could start 8 fmas within the first nanosecond (4 cycles, 2 per cycle), and then all 8 of these would have finished executing by the end of the 2nd nanosecond.
Given AVX512, that means it could complete 128 Float64 floating point operations in 2 nanoseconds, or 64 Float64 operations on average per nanosecond.
While spending half that time not actually executing instructions at all. Running it for longer can get you closer to that 128/nanosecond:

julia> using LoopVectorization, BenchmarkTools

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(C,1), n ∈ axes(C,2)
               Cₘₙ = zero(eltype(C))
               for k ∈ axes(B,1)
                   Cₘₙ += A[m,k] * B[k,n]
               end
               C[m,n] = Cₘₙ
           end
           C
       end
AmulB! (generic function with 1 method)

julia> M = K = N = 72;

julia> A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.117 μs (0.00% GC)
  median time:      6.151 μs (0.00% GC)
  mean time:        6.161 μs (0.00% GC)
  maximum time:     18.320 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> 2M*K*N / 6117
122.03629230014712

Average of 122 floating point operations completed per nanosecond.

A computer can get a lot done per nanosecond. :wink:
But anything under 1/GHz nanoseconds, i.e. less than a clock cycle, is unbelievable.

(Full disclosure, the computer I benchmarked on runs AVX512 code at 4.1 GHz rather than 4.0GHz; 4 is cleaner for explanations)

7 Likes