Aside from AVX512, the biggest improvement in Rocket Lake over the previous generation is that their integrated graphics are much better.
Still worth pointing out that my Tiger Lake laptop has an 1165G7 which comes with 96 execution units, vs 32 in Rocket Lake. Other than meaning the laptop is 3x faster for graphics, Iβm not sure what the ramifications are.
Does it matter if you donβt play games or use oneAPI.jl? Will you have problems streaming videos?
Looking online, if FP32 is an indicator, 96 EU of Xe graphics: 1_690 GFLOPS
GTX 650, which seems to sell for $60-$100: 812.5 GFLOPS
Presumably 32 EU like in Rocket Lake will get you around 560 GFLOPS.
Also on AVX512, itβs worth pointing out that there is segmentation between Intelβs HEDT/high end server chips and their low end server and consumer chips in throughput.
The chips can see huge benefit from AVX512, e.g. in AnanadTechβs test with 290W power draw the 11700K was more than 5x faster than the competition.
For example:
Matmul benchmarks on the 10980XE-HEDT (compilation, 32x32, 48x48, 72x72)
julia> using LoopVectorization, BenchmarkTools
julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.492570 seconds (2.48 M allocations: 132.454 MiB, 9.66% gc time, 99.92% compilation time)
julia> function AmulB!(C,A,B)
@avxt for n in indices((B,C),2), m in indices((A,C),1)
Cmn = zero(eltype(C))
for k in indices((A,B),(2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)
julia> @time(AmulB!(C0,A,B)); C0 β C1
10.613965 seconds (19.79 M allocations: 1.114 GiB, 4.63% gc time)
true
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 488.267 ns (0.00% GC)
median time: 513.287 ns (0.00% GC)
mean time: 513.106 ns (0.00% GC)
maximum time: 955.810 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 195
julia> 2e-9M*K*N/488.267e-9
134.22164512449132
julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000032 seconds (2 allocations: 18.078 KiB)
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.107 ΞΌs (0.00% GC)
median time: 1.158 ΞΌs (0.00% GC)
mean time: 1.161 ΞΌs (0.00% GC)
maximum time: 6.008 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> 2e-9M*K*N/1.107e-6
199.80487804878052
julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000070 seconds (2 allocations: 40.578 KiB)
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.894 ΞΌs (0.00% GC)
median time: 1.982 ΞΌs (0.00% GC)
mean time: 1.984 ΞΌs (0.00% GC)
maximum time: 9.586 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> 2e-9M*K*N/1.894e-6
394.13727560718064
Matmul benchmarks on the 1165G7-laptop (compilation, 32x32, 48x48, 72x72)
julia> @time using LoopVectorization
1.555713 seconds (2.88 M allocations: 169.122 MiB, 4.68% gc time)
julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.411346 seconds (2.47 M allocations: 131.872 MiB, 10.45% gc time, 99.94% compilation time)
julia> function AmulB!(C,A,B)
@avxt for n in indices((B,C),2), m in indices((A,C),1)
Cmn = zero(eltype(C))
for k in indices((A,B),(2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)
julia> @time(AmulB!(C0,A,B)); C0 β C1
8.947281 seconds (19.31 M allocations: 1.087 GiB, 4.63% gc time)
true
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 657.172 ns (0.00% GC)
median time: 708.396 ns (0.00% GC)
mean time: 713.941 ns (0.00% GC)
maximum time: 5.318 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 169
julia> 2e-9M*K*N/657.172e-9
99.72427309745395
julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000047 seconds (2 allocations: 18.078 KiB)
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.742 ΞΌs (0.00% GC)
median time: 1.931 ΞΌs (0.00% GC)
mean time: 1.932 ΞΌs (0.00% GC)
maximum time: 5.711 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> 2e-9M*K*N/1.742e-6
126.97129735935708
julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.003418 seconds (2 allocations: 40.578 KiB)
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.222 ΞΌs (0.00% GC)
median time: 3.493 ΞΌs (0.00% GC)
mean time: 3.491 ΞΌs (0.00% GC)
maximum time: 9.560 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> 2e-9M*K*N/3.222e-6
231.6871508379889
Apple M1 Native
julia> using LoopVectorization, BenchmarkTools
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> M = K = N = 32; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.292382 seconds (2.47 M allocations: 131.780 MiB, 8.28% gc time, 99.92% compilation time)
julia> function AmulB!(C,A,B)
@avxt for n in indices((B,C),2), m in indices((A,C),1)
Cmn = zero(eltype(C))
for k in indices((A,B),(2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)
julia> @time(AmulB!(C0,A,B)); C0 β C1
5.170382 seconds (16.66 M allocations: 948.861 MiB, 5.60% gc time)
true
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.421 ΞΌs (0.00% GC)
median time: 1.429 ΞΌs (0.00% GC)
mean time: 1.434 ΞΌs (0.00% GC)
maximum time: 2.754 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> 2e-9M*K*N/1.421e-6
46.11963406052076
julia> M = K = N = 48; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000045 seconds (2 allocations: 18.078 KiB)
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.833 ΞΌs (0.00% GC)
median time: 3.990 ΞΌs (0.00% GC)
mean time: 3.998 ΞΌs (0.00% GC)
maximum time: 7.911 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> 2e-9M*K*N/3.833e-6
57.705191755804854
julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000278 seconds (2 allocations: 40.578 KiB)
julia> @benchmark AmulB!($C0,$A,$B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.100 ΞΌs (0.00% GC)
median time: 6.650 ΞΌs (0.00% GC)
mean time: 6.662 ΞΌs (0.00% GC)
maximum time: 14.750 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
julia> 2e-9M*K*N/6.1e-6
122.37639344262297
julia> versioninfo()
Julia Version 1.7.0-DEV.763
Commit 2ec75d65ce (2021-03-29 14:29 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.3.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 8
The laptop compiles faster (8.94 vs 10.6 second βtime to first matmulβ), but runs markedly slower, e.g. 231.7 GFLOPS vs 394 GFLOPS for 72x72 matrices.
Using the master branch of LinuxPerf
; I omitted a warmup run, 10980XE (HEDT, Cascade Lake):
julia> using LinuxPerf
julia> foreachf(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf (generic function with 1 method)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(AmulB!, 100_000, C0, A, B)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 3.34e+09 50.0% # 4.1 cycles per ns
β instructions 8.32e+09 75.0% # 2.5 insns per cycle
β branch-instructions 2.13e+08 75.0% # 2.6% of instructions
β branch-misses 1.90e+06 75.0% # 0.9% of branch instructions
β task-clock 8.18e+08 100.0% # 817.6 ms
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 6.97e+08 25.0% # 26.3% of dcache loads
β L1-dcache-loads 2.65e+09 25.0%
β L1-icache-load-misses 3.87e+04 25.0%
β dTLB-load-misses 0.00e+00 25.0% # 0.0% of dTLB loads
β dTLB-loads 2.65e+09 25.0%
aggregated from 4 threads
βββββββββββββββββββββββββββββββββββββββββββ
1165G7 (laptop, Tiger Lake):
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(AmulB!, 100_000, C0, A, B)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 5.36e+09 50.1% # 4.0 cycles per ns
β instructions 8.25e+09 75.1% # 1.5 insns per cycle
β branch-instructions 1.98e+08 75.1% # 2.4% of instructions
β branch-misses 2.92e+06 75.1% # 1.5% of branch instructions
β task-clock 1.35e+09 100.0% # 1.3 s
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 5.85e+08 75.1% # 22.2% of dcache loads
β L1-dcache-loads 2.63e+09 75.1%
β L1-icache-load-misses 2.77e+04 75.1%
β dTLB-load-misses 1.01e+03 24.9% # 0.0% of dTLB loads
β dTLB-loads 2.67e+09 24.9%
aggregated from 4 threads
βββββββββββββββββββββββββββββββββββββββββββ
While the clock speed is similar, the 10980XE hits 2.5 instructions per clock, vs just 1.5 in the laptop.
Tiger lake (and rocket lake) actually have larger reorder buffers and probably better branch predictors, etc, as well than Cascade Lake. However, they have just a single port capable of performing many common 512 bit instructions like the fused multiply add, while Cascade Lake has 2.
The link shows Cascade Lake can use ports 0 or 5, while Ice Lake can only use port 0. As a result, the reciprocal throughput is 0.5 for Cascade Lake, and 1 for Ice Lake.
But linear algebra/matlab tends to be extreme. It fairs much better in these special function benchmarks, for example:
Cascade Lake (HEDT)
julia> using VectorizationBase, SLEEFPirates
julia> vu = Vec(10 .* rand(16)...)
2 x Vec{8, Float64}
Vec{8, Float64}<8.798383134546386, 9.022828046665666, 7.595386605047971, 8.903364923350454, 1.439724621424312, 8.799483255120942, 7.529824692778755, 9.398678780573114>
Vec{8, Float64}<1.0919972116624876, 8.5262997763817, 1.3898563445399836, 3.1224598343675214, 5.264211844189135, 4.618134635075415, 0.09844041961554195, 8.096211429945946>
julia> @benchmark log($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.034 ns (0.00% GC)
median time: 8.296 ns (0.00% GC)
mean time: 8.263 ns (0.00% GC)
maximum time: 24.097 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark exp($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.918 ns (0.00% GC)
median time: 5.999 ns (0.00% GC)
mean time: 6.023 ns (0.00% GC)
maximum time: 22.572 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.931 ns (0.00% GC)
median time: 4.958 ns (0.00% GC)
mean time: 4.974 ns (0.00% GC)
maximum time: 20.548 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 17.426 ns (0.00% GC)
median time: 17.515 ns (0.00% GC)
mean time: 17.544 ns (0.00% GC)
maximum time: 34.147 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
Tiger Lake (laptop)
julia> using VectorizationBase, SLEEFPirates
julia> vu = Vec(10 .* rand(16)...)
2 x Vec{8, Float64}
Vec{8, Float64}<3.3090933998106964, 3.2408725486003176, 9.336803901394287, 0.920913844633362, 8.254820718984012, 5.840193768054263, 4.970561992905824, 3.856892363553288>
Vec{8, Float64}<2.4728363240217877, 4.159144687195548, 4.057230309919828, 2.8915047878074995, 3.2782916658210492, 3.5649015726035516, 5.126827865591639, 4.177712129409485>
julia> @benchmark log($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.281 ns (0.00% GC)
median time: 8.680 ns (0.00% GC)
mean time: 8.731 ns (0.00% GC)
maximum time: 87.352 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark exp($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.542 ns (0.00% GC)
median time: 6.999 ns (0.00% GC)
mean time: 6.925 ns (0.00% GC)
maximum time: 22.851 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.542 ns (0.00% GC)
median time: 6.860 ns (0.00% GC)
mean time: 6.903 ns (0.00% GC)
maximum time: 22.935 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.802 ns (0.00% GC)
median time: 16.948 ns (0.00% GC)
mean time: 16.948 ns (0.00% GC)
maximum time: 32.268 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
Threadripper 1950X (Zen1 Ryzen)
julia> using VectorizationBase, SLEEFPirates, BenchmarkTools
julia> vu = Vec(10 .* rand(16)...)
4 x Vec{4, Float64}
Vec{4, Float64}<2.9854020761951205, 0.5850049491494591, 8.048794496707682, 2.4447908274683128>
Vec{4, Float64}<8.989862723010972, 3.432180061119565, 5.218481943492845, 1.2908407939196165>
Vec{4, Float64}<7.147163281833258, 3.1638484401188416, 1.7880777724675267, 8.886571714674718>
Vec{4, Float64}<8.27759419030089, 3.093939597569817, 7.528383369444482, 1.3747962690374105>
julia> @benchmark log($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 55.613 ns (0.00% GC)
median time: 62.459 ns (0.00% GC)
mean time: 62.445 ns (0.00% GC)
maximum time: 123.043 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 982
julia> @benchmark exp($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 27.950 ns (0.00% GC)
median time: 31.336 ns (0.00% GC)
mean time: 31.293 ns (0.00% GC)
maximum time: 60.194 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 994
julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 29.472 ns (0.00% GC)
median time: 32.920 ns (0.00% GC)
mean time: 32.892 ns (0.00% GC)
maximum time: 85.069 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 994
julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 53.570 ns (0.00% GC)
median time: 58.309 ns (0.00% GC)
mean time: 58.315 ns (0.00% GC)
maximum time: 98.537 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 983
Apple M1 Native
julia> using VectorizationBase, SLEEFPirates
julia> vu = Vec(10 .* rand(16)...)
8 x Vec{2, Float64}
Vec{2, Float64}<8.349313127641388, 7.039690217372789>
Vec{2, Float64}<6.433077972087, 2.863664951775906>
Vec{2, Float64}<1.4755302144925642, 8.827766753134048>
Vec{2, Float64}<6.467495205043918, 7.489247759558242>
Vec{2, Float64}<1.755162644186694, 2.895612011356581>
Vec{2, Float64}<8.843342380245545, 9.290792214755223>
Vec{2, Float64}<6.224625954236844, 1.4039314108511536>
Vec{2, Float64}<5.011570176713553, 1.2622980565493647>
julia> @benchmark log($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 38.096 ns (0.00% GC)
median time: 38.223 ns (0.00% GC)
mean time: 38.389 ns (0.00% GC)
maximum time: 56.409 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 992
julia> @benchmark exp($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 20.040 ns (0.00% GC)
median time: 20.248 ns (0.00% GC)
mean time: 20.322 ns (0.00% GC)
maximum time: 32.899 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark exp2($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 19.790 ns (0.00% GC)
median time: 19.956 ns (0.00% GC)
mean time: 19.975 ns (0.00% GC)
maximum time: 42.877 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark sincos($vu)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 34.631 ns (0.00% GC)
median time: 34.799 ns (0.00% GC)
mean time: 34.865 ns (0.00% GC)
maximum time: 47.530 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 995
julia> versioninfo()
Julia Version 1.7.0-DEV.763
Commit 2ec75d65ce (2021-03-29 14:29 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.3.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 8
You could run these benchmarks on your own system to get an idea of performance changes. Rocket Lake should of course be faster than Tiger Lake (similar arch, faster clock speed).
E.g, sincos
, Tiger Lake:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(sincos, 10_000_000, vu)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 2.98e+09 50.0% # 4.4 cycles per ns
β instructions 8.26e+09 75.2% # 2.8 insns per cycle
β branch-instructions 1.24e+09 75.2% # 15.1% of instructions
β branch-misses 1.60e+06 75.2% # 0.1% of branch instructions
β task-clock 6.79e+08 100.0% # 678.7 ms
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 7.62e+07 75.3% # 3.2% of dcache loads
β L1-dcache-loads 2.39e+09 75.3%
β L1-icache-load-misses 4.88e+04 75.3%
β dTLB-load-misses 1.62e+06 24.7% # 0.1% of dTLB loads
β dTLB-loads 2.35e+09 24.7%
βββββββββββββββββββββββββββββββββββββββββββ
Cascade Lake:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(sincos, 10_000_000, vu)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 3.18e+09 50.0% # 4.1 cycles per ns
β instructions 8.27e+09 75.0% # 2.6 insns per cycle
β branch-instructions 1.24e+09 75.0% # 15.0% of instructions
β branch-misses 1.49e+06 75.0% # 0.1% of branch instructions
β task-clock 7.83e+08 100.0% # 783.4 ms
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 7.56e+07 25.0% # 3.1% of dcache loads
β L1-dcache-loads 2.45e+09 25.0%
β L1-icache-load-misses 3.03e+05 25.0%
β dTLB-load-misses 2.41e+05 25.0% # 0.0% of dTLB loads
β dTLB-loads 2.45e+09 25.0%
βββββββββββββββββββββββββββββββββββββββββββ
Tiger Lake had both higher IPC and clock speeds. This is despite the fact that all of these special functions are also full of fma instructions (polynomials) β just not quite as full as matmul is.
Looking at the generated code of sincos, I think I could do a better job optimizing it for AVX512.
log
and exp2
are both probably much faster than what you can get without AVX512.
But the best I can do benchmarkwise for comparison is Haswell, which came out in 2013. Iβd be interested in comparisons with Zen2 and Zen3.
Note that the Ryzen (3/5)900X and (3/5)950X have 64 MiB of L3 cache. Even though they have only 2 memory channels, hopefully you could work around this by chunking your data and making use of the L3.
You can fit a lot of data in 64 MiB. Many compute heavy workloads would fit in it entirely.
When talking BLAS,
julia> sqrt(64 * (1<<20) / 8 / 3)
1672.184997739983
You can fit three 1600x1600 matrices inside 64 MiB.
I hope one day weβll be able to buy something like the a64fx, which has 1 TB/second memory bandwidth β many times more than any x64 cpu.
Many models fit with MCMC or that have ODEs can be extremely compute intensive without requiring that much memory.